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Abstract 

Performing statistical inference in high-dimensional models is an outstanding challenge. A ma¬ 
jor source of difficulty is the absence of precise information on the distribution of high-dimensional 
regularized estimators. 

Here, we consider linear regression in the high-dimensional regime p ^ n and the Lasso 
estimator. In this context, we would like to perform inference on a high-dimensional parameters 
vector 9* S Important progress has been achieved in computing confidence intervals and 
p-values for single coordinates g*, i € {1,... ,p}. A key role in these new inferential methods is 
played by a certain debiased (or de-sparsified) estimator 9‘^ that is constructed from the Lasso 
estimator. Earlier work establishes that, under suitable assumptions on the design matrix, the 
coordinates of 0'^ are asymptotically Gaussian provided the true parameters vector 9* is so-sparse 
with So = o{^/n/ logp). 

The condition sq = o{^/ri/ logp) is considerably stronger than the one required for consistent 
estimation, namely sq = o(n/logp). In this paper, we consider Gaussian designs with known 
or unknown population covariance. When the covariance is known, we prove that the debiased 
estimator is asymptotically Gaussian under the nearly optimal condition sq = oinj (logp)^). Note 
that earlier work was limited to sq = o(i/n/logp) even for perfectly known covariance. 

The same conclusion holds if the population covariance is unknown but can be estimated 
sufficiently well, e.g. under the same sparsity conditions on the inverse covariance as assumed 
by earlier work. For intermediate regimes, we describe the trade-off between sparsity in the 
coefficients 9*, and sparsity in the inverse covariance of the design. We further discuss several other 
applications of our results to high-dimensional inference. In particular, we propose a thresholded 
Lasso estimator that is minimax optimal up to a factor 1 -1- o„(l) for i.i.d. Gaussian designs. 
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1 Introduction 


1.1 Background 

Consider random design model where we are given n i.i.d. pairs (yi,xi), {y 2 ,X 2 ), ■ ■ ■, {yn,Xn) with 
yi G M, and Xi G The response variable yi is a linear function of Xi, contaminated by noise Wi 
independent of Xi 

yt = {9* ,Xi)+Wi, Wi ~ N(0 ,ct^) . (1) 

Here 9* G is a vector of parameters to be estimated and (•, •) is the standard scalar product. 

In matrix form, letting y = (yi,..., yn)"*" and denoting by X the matrix with rows xj,- ■ ■, x^ we 
have 


y = X9*+w, w^N(0,a\^n)- 


( 2 ) 


We are interested in the high-dimensional regime wherein the number of parameters p exceeds 
the sample size n. Over the last 20 years, impressive progress has been made in developing and 
understanding highly effective estimators in this regime [CT07, BRT09, BvdGll]. A prominent 
approach is the Lasso [Tib96, CD95] defined through the following convex optimization problem 

0^““(y,A;A) = argmax|;^||y-Aeill + Alieilij . (3) 

aGM.P I ZTl J 

(We will omit the arguments of 0^““(y, A; A) whenever clear from the context.) 

A far less understood qnestion is how to perform statistical inference in the high-dimensional 
setting, for instance compnting confidence intervals and p-values for quantities of interest. Progress 
in this direction was achieved only over the last couple of years. In particular, several papers 
[BlihlS, ZZI4, JM14b, VdGBRD14, JMI4a] develop methods to compute confidence intervals for 
single coordinates of the parameters vector 9*. More precisely, these methods compute intervals 
Ji(a) depending on y,X, of nearly minimal size, with the coverage gnarantee 

P(9* e Ji(a)) > 1 - a - On(l) . (4) 

The On(l) term is explicitly characterized, and vanishes along sequence of instances of increasing 
dimensions under snitable condition on the design matrix X. 

The fundamental idea developed in [ZZ14, JM14b, VdGBRD14, JM14a] is to construct a debiased 
(or de-sparsified) estimator that takes the form 

^ ^Lasso ^MX'^fy - , (5) 

n 

where M G is a matrix that is a function of X, but not of y. While the construction of M 

varies across different papers, the basic intuition is that M should be a good estimate of the precision 
matrix H where S = is the population covariance. 

Assnme 9* is so-sparse, i.e. it has only sq non-zero entries. The key result that allows the 
construction of confidence intervals in [ZZ14, VdGBRDI4, JM14a] is the following (holding under 
suitable conditions on the design matrix). If M is ‘snfficiently close’ to H, and the sparsity level is 


So < 



logy ’ 


( 6 ) 
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then of is approximately Gaussian with mean 9* and variance of order jn. 

The condition ( 6 ) comes as a surprise, and is somewhat disappointing. Indeed, consistent estima¬ 
tion using -for instance- the Lasso can be achieved under the much weaker condition sq <C n/logp. 
More specifically, in this regime, with high probability [CT07, ZH08, BRT09, YZIO, BvdGll] 

II^Lasso _^*||2 < Csoa 

This naturally leads to the following question: 

Does the debiased estimator have a Gaussian limit under the weaker condition sq <C 
n/ logp? 

Let us emphasize that the key technical challenge here does not lie in the fact that M is not a 
good estimate of the precision matrix fl. Of course, if M is not close to O, then 9‘^ will not have a 
Gaussian limit. However earlier proofs [ZZI4, VdGBRDlf, JMlfaj cannot establish the Gaussian 
limit for sq > i/n/logp, even if H is known and we set M = kl. Even the idealized case where the 
columns of X are known to be independent and identically distributed (i.e. H = I) is only understood 
in the asymptotic limit SQ,n,p —?• 00 with so/p, n/p having constant limits in (0,1) [JM14b]. 

In order to describe the challenge, let us set M = kl, and recall the common step of the proofs in 
[ZZ14, VdGBRD14, JMI4a]. Using the definitions (2), (5), we get 

_ 0*^ = ^ -^nx'^iXO* +w- 

, ^ ( 8 ) 
= —nx^w + - i){e* - 0 ^“=°), 

\/n 


where S = X'^Xjn G is the empirical design covariance. Since w ~ N(0,cj^ln), it is easy 

to see that vector QX'^w/^/n has Gaussian entries of variance of order one. In order for 6'^ to be 
approximately Gaussian, we need the second term (which can be interpreted as a bias) to vanish. 
Earlier papers [ZZ14, VdGBRD14, JM14a] address this by a simple ii-icxi bound. Namely (denoting 
by IQ 1 00 the maximum absolute value of any entry of matrix Q): 


- 1 ){ 9 * 


< Vra|HS-I|oo||0* 


< y/n X C 



X Csoa 



<pj^f0logP 


n 


(9) 


where the bound |HS — I|oo < C'y^(logp)/n follows from standard concentration arguments, and the 
bound on ||0* — 0 Lasso||^ jg order-optimal and is proved, for instance, in [BRT09, BvdGll]. 

This simple argument implies that the debiased estimator is approximately Gaussian if the upper 
bound in Eq. (9) is negligible, i.e. if sq = o{y/n/ logp). We see therefore that this requirement is 
not imposed as to control the error in estimating H. It instead follows from the simple ii-ioo bound 
even if Q is known. 
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1.2 Main results 


The above exposition should clarify that the bound is quite conservative. Considering the 

i-th entry in the bias vector bias = (US — I)(0* — (-i-d-oo bound controls it as |biasj| < 

||(ns — l)j^.||oo||0* 111- This bound would be accurate only if the signs of the entries {0*. — 0^'"’^°) 

were aligned to the signs (flS — T)i,j, j G {1,... ,p}. While intuitively this is quite unlikely, it is 
difficult to formalize this intuition; Note that in a random design setting, the terms (flS — I)*^. and 
6* — are highly dependent: 0^“=° is a deterministic function of the random pair (X,w), while 
(ns — I) = /n — I) is a function of X. 

Our main result overcomes this technical hurdle via a careful analysis of such dependencies. We 
follow a leave-one-out proof technique. Roughly speaking, in order to understand the distribution of 
the z-th coordinate of the debiased estimator 0^, we consider a modified problem in which column i 
is removed from the design matrix X. We then study the consequences of adding back this column, 
and bound the effect of this perturbation. An outline of this proof strategy is provided in Section 
6.1. 

We state below a simplified version of our main result, referring to Theorem 3.8 below for a full 
statement, including technical conditions. 

Theorem 1.1 (Known covariance). Consider the linear model (2) where X has independent Gaus¬ 
sian rows, with zero mean and covariance S = 17“^. Assume that S satisfies the technical con¬ 
ditions stated in Theorem 3.8. Define the debiased estimator 0'^ via Eq. (5) with M = Q, and 
^Lasso ^ A) with A = 8a y/{log p) /u. 

If n,p —)■ oo with So = o{n/{logpfi), then we have 

- 9*) = Z + op{l), Z|X ~ N(0,o-2f7Sf7). (10) 

Here op(l) is a (random) vector satisfying ||op(l)||oo -G- 0 in probability as n,p —>■ oo, and Z\X ~ 
N(0, a^nSU) means that the conditional distribution of Z given X is centered Gaussian, with the 
stated covariance. 

Remark 1.2. The more complete statement of this result. Theorem 3.8 provides explicit non- 
asymptotic bounds on the error term op(l). In particular ||op(l)||oo turns out to be of order 
y/sojn (logp) with probability converging to one as n,p —)■ oo. 

Theorem 1.1 raises an important question: Does the Gaussian limit hold even if M is an imperfect 
estimate of Q? 

If the precision matrix is sufficiently structured, then it can be reliably estimated from the 
design matrix X. Both [ZZ14] and [VdGBRD14] assume that 17 is sparse, and use the node-wise 
Lasso to construct an estimate 17 [MB06]. They then set M = 17. 

We followed the same procedure and hence generalized Theorem 1.1 to the setting of unknown, 
sparse precision matrix. We state here a simplified version of this result, deferring to Theorem 3.13 
for a more technical statement including non-asymptotic probability bounds. 

Theorem 1.3 (Unknown covariance). Consider the linear model (2) where X has independent Gaus¬ 
sian rows with precision matrix 17, satisfying the technical conditions of Theorem 1.1 (stated in Theo¬ 
rem 3.8). Define the debiased estimator 9^^ via Eq. (5) with 0^"“=° = 0’"“=°(y, X; A), A = 8ay/{logp)/n, 
and M = 17 computed through node-wise Lasso (see Section 3.3). 
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Let sq the maximum number of non-zero entries in any row of LI. If n,p ^ oo with sq = 
o(n/(logp)^) and min(sf 2 ,so) = o{^/n/ logp), then we have 

y/fi{e^ - e*) = z + op{i) , z|x ~ N(o,o-^oso) , (ii) 

where op{l) is a (random) vector satisfying ||op(l)||oo —>■ 0 m probability as n,p —)> oo. 

Remark 1.4. As mentioned above, this version of the debiased estimator can be constructed entirely 
from data. The only unspecified steps are the choice of the regularization parameter A, and the 
estimation of the noise level a. These can be addressed as in [ZZ14, VdGBRD14, JM14a] without 
changes in the sparsity condition : we will further discuss these points below. 

Remark 1.5. The sparsity condition min(so,so) = o(^/n/ logp) nicely illustrates the practical 
improvement implied by our more rehned analysis. If the sparsity of the precision matrix is larger 
than the sparsity of 6*, we recover the condition sq = o{^/n/ logp) which is assumed in the results 
of [ZZ14, VdGBRD14]. (Note that [JM14a] obtain the same condition without sparsity assumption 
on n.) In this regime, our improved analysis does not bring any advantage, since the bottleneck is 
due to the inaccurate estimation of Ll. 

On the other hand, if the precision matrix is sparser, we obtain a much weaker condition on the 
coefficients 6*. In particular, if sq = o{^/n/ logp), then the condition on sq is relaxed into a nearly 
optimal condition sq = o(n/(logp)^). 

It is instructive to compare this with the past progress in sparse estimation and compressed 
sensing. In that context, earlier work based on incoherence conditions [DHOl, DET06] implied 
accurate reconstruction from a number of random samples scaling quadratically in the number of non¬ 
zero coefficients. Subsequent progress was based on the restricted isometry property [CRT06, CT07], 
and established accurate reconstruction from a linear number of measurements. 

1.3 Extensions and applications 

Sample splitting. An alternative approach to avoid the ii-ioo bound in Eq. (9) is to modify the 
definition of debiased estimator in Eq. (5), using sample-splitting. Roughly speaking, we can split 
the same in two batches of size n/2. One batch is then used to estimate and the other batch 
for y and X appearing in Eq. (5) (and possibly for computing M). 

Appendix H discusses in greater detail this method. This approach is subject to variations due 
to the random splitting, and does not make use of part of half of the response variables. While it 
provides a viable alternative, it is not the focus of the present work. 

Confidence intervals. Theorem 1.3 (and its formal version, Theorem 3.13) allows the construction 
of confidence intervals using the same general procedure as in [ZZ14, VdGBRDll, JM14a]. Namely, 
we construct the debiasing matrix M from the design matrix X, and an estimate a of the noise 
variance. Then, for a significance level a G (0,1), we form the following confidence interval for 
parameter Op. 

Ji{a) = [9f - 6{a,n),9f-\-6{a,n)] (12) 

6{a,n) = ^-\l-a/2)^{MtM^)y^, (13) 
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where <h(x) = is the Gaussian distribution. Section 3.3 presents a formal analysis 

of this procedure. A straightforward generalization also allows to compute p-values for the null 
hypothesis //q,* : = 0. 

Noise level and regularization. The construction of the confidence interval Jj(a) in Eqs. (12), 
(13) requires a suitable choice of the regularization parameter A, and an estimate of the noise level 
a. The same difficulty was present in [ZZ14, VdGBRD14, JM14a]. The approaches used there (for 
instance, using the scaled Lasso [SZ12]) can be followed in the present case as well. Under the 
assumptions of Theorem 1.1, the same proofs of [JM14a] show that the additional error due to the 
choice of A and a are negligible. 

Semi-supervised learning. In some applications, the precision matrix can be estimated more 
accurately thanks to additional information. For instance, in semi-supervised learning, the statisti¬ 
cian is given additional samples xi,T 2 ,..., xv G with the same distribution as the {xi}i<i<n. For 
these ‘unlabeled’ samples, the response variable is unknown. There are indeed many applications 
in which acquiring the response variable is much more challenging than capturing the covariates 
[GSZ06], and therefore N ^ n or even N ^ p. In this setting, we can estimate 11 more accurately 
from {xj}i<j<jv, then use this estimate to construct M. 

Non-Gaussian designs. We expect that generalization of Theorem 1.1 and Theorem 1.3 should 
hold for a broad class of random designs with independent sub-Gaussian rows, although new proof 
ideas are required. The main technical challenge in extending the present approach is to generalize the 
leave-one-out construction. As discussed in Section 6.1, when studying the effect of modifying column 
i, we need to account for dependencies between columns. For Gaussian designs, these dependencies 
are fully captured by the design covariance S. 

Note that the Gaussian assumption holds in the context of estimating Gaussian graphical models. 
This is itself a broad topic that attracted significant interest, since the seminal work of [MB06]. Re¬ 
markably, recent contributions have shown the utility of debiasing methods in this context [JvdG”''15b, 
CRZZ15, JvdGlSa]. 

1.4 Organization and contributions 

The rest of the paper presents the following contributions: 

1. Section 3. We state formally our Gaussian limit theorems, and use them to construct valid 
confidence intervals, of nearly optimal size. In particular, our results subsume (and improve) 
all previously known results on the debiased estimator for Gaussian designs. 

2. Section 4. We establish a minimax lower bound on the £00 norm of the non-Gaussian component 
in 9'^. This implies that our Gaussian limit theorems cannot be substantially improved. 

3. Section 5. Apart from the construction of confidence intervals, our Gaussian limit theorems 
have several fundamental implications. We discuss a a few examples, that we consider partic¬ 
ularly interesting. In particular, we construct a thresholded Lasso estimator that is minimax 
optimal up to a factor (1 -|- 0 ^( 1 )) (an alternative approach to the same problem was recently 
proposed in [SC15]). 
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Section 2 discusses relations with earlier work in this area. Outlines of the proofs of the main theorems 
are given in Section 6 and Section 7 with most of the technical work deferred to appendices. 

2 Related work 

A parallel line of research develops methods for performing valid inference after a low-dimensional 
model is selected for fitting high-dimensional data [LTTT14, FST14, TLTT14, CHS15]. The re¬ 
sulting significance statements are typically conditional on the selected model. In contrast, here we 
are interested in classical (unconditional) significance statements: the two approaches are broadly 
complementary. 

The focus of the present paper is assessing statistical significance, such as confidence intervals, 
for single coordinates in the parameters vector 9* and more generally for small groups of coordinates. 
Other inference tasks are also interesting and challenging in high-dimension, and were the object of 
recent investigations [BEM13, BC'''15, JBC15, JS15]. 

Sample splitting provides a general methodology for inference in high dimension [WR09, MBIO]. 
As mentioned above, sample splitting can also be used to define a modified debiased estimator, see 
Appendix H. However sample splitting techniques typically use only part of the data for inference, 
and are therefore sub-optimal. Also, the result depend on the random split of the data. 

A method for inference without assumptions on the design matrix was developed in [Meil4]. The 
resulting confidence intervals are typically quite conservative. 

The debiasing method was developed independently from several points of view [BiihlS, ZZ14, 
JM14b, VdGBRD14, JM14a]. The present authors were motivated by the AMP analysis of the 
Lasso [DMM09, BMll, BM12, BLM15], and by the Gaussian limits that this analysis implies. In 
particular [JM14b] used those techniques to analyze standard Gaussian designs (i.e. the case S = I) 
in the asymptotic limit n,p, sq —)• oo with sq/p, n/p constant. In this limit, the debiased estimator 
was proven to be asymptotically Gaussian provided sq < Cn/\og{p/sQ) (for a universal constant 
C). This sparsity condition is even weaker than the one of Theorem I.l (or Theorem 3.8), but the 
result of [JM14b] only holds asymptotically. Also [JM14b] proved Gaussian convergence in a weaker 
sense than the one established here, implying coverage of the constructed confidence intervals only 
‘on average’ over the coordinates i G {1,... ,p}. 

A non-asymptotic result under weaker sparsity conditions, and for designs with dependent columns, 
was proved in [JMI3]. However, this only establishes gaussianity of Of for most of the coordinates 
i G {1,... ,p}. Here we prove a significantly stronger result holding uniformly over i G {1,... ,p}. 

Most of the work on statistical inference in high-dimensional models has been focused so far on 
linear regression. The debiasing method admits a natural extension to generalized linear models that 
was analyzed in [VdGBRD14]. Robustness to model misspecification was studied in [BvdG'*'15]. An 
R-package for inference in high-dimension that uses the node-wise Lasso is available [DBM'*'15]. An 
R implementation of the method [JMI4a] (which does not make sparsity assumptions on H) is also 
available^. 


^See http://web.stanford.edu/ montanar/sslasso/. 
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3 Main results: Gaussian limit theorems 


3.1 General notations 

We use Cj to refer to the i-th standard basis element, e.g., ei = (1,0, For a vector v, 

supp(u) represents the positions of nonzero entries of v. Further, sign(u) is the vector with entries 
sign(u)i = +1 if Vi > 0, sign(u)j = —1 if Vi < 0, and sign(u)j = 0 otherwise. For a matrix M G 
and a set of indices J C [p] we use Mj to denote the submatrix formed by columns in J. Likewise, 
for a vector 9 and a subset S, 6s is the restriction of 9 to indices in S. For an integer p > 1, we use 
the notation [p] = {1, • • • ,p} and the shorthand ~ i for the set \p]\i. We write ||u||p for the standard 
ip norm of a vector v, i.e., ||u||p = (Yli ||u||o for the number of nonzero entries of v. For 

a matrix A G ||^||p denotes it ip operator norm; in particular, ||A||oo = niaxi<j<m 

This is to be contrasted with the maximum absolute value of any entry of A that, as mentioned 
above, we denote by |A|oo = For a matrix A, we denote its maximum and minimum 

singular values by cTmax(^) and crmin(^), respectively. If A is symmetric, Amax(^) and Amin(^) are 
its maximum and minimum eigenvalues. Finally, for two functions f{n) and g{n), the notation 
/(n) » g{n) means that / ‘dominates’ g asymptotically, namely, for every fixed positive C, there 
exists n(C') such that /(n) > Cg{n) for n > n{C). We also use /(n) < g{n) to indicate that / is 
‘bounded’ above by g asymptotically, i.e., f{n) < Cg{n) for some positive constant C. The notations 
/(n) <C g{n) and /(n) = o{g{n)) are defined analogously, and we use op{ ■) to indicate asymptotic 
behavior in probability as the sample size n tends to infinity. 

We will use c,C,... to denote generic constants that can vary from one position to the other of 
the paper. 

3.2 Preliminaries 

This section includes some preliminary results that are repeatedly used in our proofs. We start by 
some well-known results about the Lasso estimator. For the sake of simplicity, we will often use 
9 = 9{y,X-,X) instead of 9^^^^^° to denote the Lasso estimator. 

We denote the rows of the design matrix X by xi,..., G and its columns by xi,..., Xp G M"'. 
The empirical covariance of the design X is defined as S = (X'^X)/n. The population covariance 
will be denoted by S, and we let O = be the precision matrix. 

Definition 3.1. Given a symmetric matrix S G and a set S' C [p], the corresponding compat¬ 

ibility constant is defined as 

0 G < 3||0s||i} • (14) 

We say that S G satisfies the compatibility condition for the set 5 C [p], with constant (p if 

S) > p. We say that it holds for the design matrix X, if it holds for S = X'^Xjn. 

It is also useful to recall some notation for the restricted eigenvalue condition, introduced by 
Bickel, Ritov and Tsybakov [BRT09]. For an integer 0 < sq < p and a positive number L, define 
C{so,L) G MP by the following cone constraints: 


C(so,L) = {0gM^’: 35 C [p], |5| = so, ||0sc||i < L||0s||i} . 


(15) 



In high-dimension, the empirical covariance S is singular. However, we can ask for non-singularity 
of S for vectors in C(so, L). Rudelson and Zhou [RZ13] prove a reduction principle that bounds the 
restricted eigenvalues of the empirical covariance in terms of those of the population covariance. We 
will use their result specified to the case of Gaussian matrices. 

Lemma 3.2 ([RZ13], Theorem 3.1). Suppose that (J min fT) > Cmin > 0 and (Tmax(51) < Cmax < oo. 
Let X G have independent rows drawn from N(0, S). Set 0 < 6 < 1, 0 < sq < p, and L > 0. 

Define the following event 

Bsin, so,L) = \x€ : (1 - <5) < (1 + -5), Vu G C(so, L) s.t. u / o|. (16) 

Then, there exists a constant ci = ci(L) such that, for sample size n > cisolog{p/ sq), we have 

F{13s{n,so,L))>l-2e-^"^. (17) 


Remark 3.3. Fix 5 C [p] with |5| = sq. Under the event Bs{n, so,3), we have 


(^^(S, S) > min 


so{0,m 


> min 


{o,m 


e&ci7f,3) II % Ilf 0eC(so,3) ||6»s||f 

where the second inequality follows from Cauchy-Schwartz inequality. 
We next introduce the event 


> (1 - S)^Cn 


B{n,p) = -frc G M”" : — ||X'''u)||oo < 2(T'\/—. (18) 

[ n V n J 

On B{n,p) we can control the randomness due to the measurement noise. A well-known union bound 
argument shows that B{n,p) has large probability (see, for instance, [BvdGll]). 

Lemma 3.4 ([BvdGll], Lemma 6.2). Suppose that Tiu < 1 for z G [pj. Then we have 

F{B{n,p)) > 1 — . 


The following Lemma states that the Lasso estimator is sparse. Its proof is given in Appendix A. 

Lemma 3.5. Consider the Lasso selector 6 with A = Kay^logp/n, for a constant k > 8. On the 
event B = B{n,p) n Bs{n, sq, 3), the following holds: 

\S\<C,so, (19) 

with 

^ _ 16 Cn,ax 

- (1 _ S)2C^.^ 

Our next Lemma states a property of Gaussian design matrices which will be used repeatedly in 
our analysis. Its proof is very short and is given here for the reader’s convenience. 

Lemma 3.6. Let Vi = XQci. Then Vi and Xr^i are independent. 
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Proof. Define u = Vtci and fix j 7 ^ i. Recall that denotes the t'-th column of X. We write 
Vi = and 

p p p 

E(ujXj ) — ^ ^ UfEi^XiXj ) — ^ ^ UiYj^jlnxn — ^ ^ ^ii^ij^nxn — (f^^)fijlnxn — 0 j 
i=l l=l 1=1 

where the last step holds since i ^ j. Since Vi and Xj are jointly Gaussian, this implies that they are 
independent. □ 

We finally introduce some parameters that are used in stating our main theorems. For an integer 
k and an invertible matrix A G we define p{A, k) as follows: 

p(A,k)= max \\Aflp\\oo , ( 21 ) 

TC\p],\T\<k " ’ 

where we adopt the convention Aflp = {At,t)~^ and recall that || • ||oo denotes the £00 operator norm 
(maximum norm of the rows). It is clear that p{A, k) is non-decreasing in k. 

Lemma 3.7. For an invertible matrix A, we have 

p[A,p) = \\A-^U. ( 22 ) 

Lemma 3.7 is proved in Appendix B. As a result of Lemma 3.7 and the non-decreasing property 
of p{A, k), for any 1 < A: < p we have 

p{A,k) < p{A,p) = \\A-^\\^. (23) 

Another bound on p{A, k) is as follows: 

y/k 

p(A,k) < max max Vk\\Aflpei \\2 < max Vk crraaxiAflp) < -——. (24) 

TC[p],\T\<k j&lp] TC[p],|T|<fc Umin(A) 

3.3 Statement of main theorems 

In our first theorem, we assume that the precision matrix Q = is available and we set M = Q. We 
prove the corresponding debiased estimator is asymptotically unbiased provided that n 3> so(logp)^. 

3.3.1 Known covariance 

Theorem 3.8 (Known covariance). Consider the linear model (2) where X has independent Gaus¬ 
sian rows, with zero mean and covariance S and 9* is SQ-sparse. Suppose that S satisfies the following 
conditions: 

(i) For i £ Ip], we have Sjj < 1. 

(ii) We have U min fL!) > Cmin > 0 and crmax(51) < Cmax for some constants Cmw and Gmax- 
{in) Define Cq = (32Gmax/Gmin) + 1- We have p(S, CoSq) < p, for some constant p > 0. 
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Let 9 be the Lasso estimator defined by (3) with A = Ka^y{logp)/n, for k G [8, Kmax]- Further, let 9^ 
be defined as per equation (5), with M = Q = Then, there exist constants c, C depending solely 

on Cmin,C'max; o.nd K ma x, such that, for n > max(25 logp, csq log(p/so)) the following holds true: 

y/n(9‘^ - 9*) = Z + R, Z|X ~ N(0,o-2^2SO) , (25) 

IP(p||oo > Cpa^logp'^ < +pe-^/^^^° + 8p-^ , (26) 

with C* = Cmin/IG. 

The proof of this theorem is presented in Section 6 . 

This theorem states that if the sample size satisfies n = r2(sologp), then the maximum size of 
the ‘bias’ Ri over f G [p] is bounded by 

||-R||oo = Op(^y^logp^ . 

On the other hand, each entry of the ‘noise term’ Zi has variance Applying Lemma 7.2 

in [JM13], we have — n|oo = op(l) and thus minjg[p](nsn)jj > rmnuLln — op(l) is of order 

one because Tin > Hence, |iij| is much smaller than Zi for n ^ so(logp)^. We summarize this 

observation in the remark below. 


Remark 3.9. {Discussion of the assumptions on T,.) Assumption (i) sets the normalization of the 
design matrix. Assumptions {ii) on the eigenvalues of S is common in high-dimensional models. 
Further, note that by Assumption {ii) and invoking Eq. (24), we have p(S,C'oSo) < y/CoSo/Cmm- 
Using this bound for p in Eq. (26), we recover the bound ||7?||oo ^ s^Xogp/y/n which is established 
in previous work [ZZ14, VdGBRD14, JM14a]. Note that this bound on the bias does not require 
Assumption {in) (namely, that p is a bounded constant). However, Theorem 3.8 asserts that, if p is 
a constant (Assumption {Hi)), we have a sharper bound on the bias, namely ||R||oo ^ \/~soJn logp. 

A large family of covariance matrices satisfy conditions of Theorem 3.8. Examples include block 
diagonal matrices where the size of blocks are bounded, and circulant matrices, where Sjj = 
for some r G (0,1). 

Corollary 3.10. Under the assumptions of Theorem 3.8, if sq <C n/(logp)^, then 9'^ is normal 
distributed. More precisely, let a = a{y,X) be an estimator of the noise level satisfying, for any 
e > 0, 


lim sup 

"--^°°0*eiRP;||eo||o<so 



a 



= 0 . 


(27) 


If So <C n/(logp)^, then, for all x G M, we have 

lim sup 

"-^°°6)oeKP;||0*||o<so 

Armed with a precise distributional characterization of 0'^, we can construct asymptotically valid 
confidence intervals for each parameter 0o,i as per Eqs. (12), (13). Validity of the constructed con¬ 
fidence intervals requires a consistent estimator of a. There are several proposal for such estimator. 


Vn{9f - 9*) 


a 




1/2 

i.i 


< X > — <h(x) 


= 0 , 


(28) 


11 










A non-exhaustive list includes [FLOl, FLOS, SBvdGlO, ZhalO, SZ12, BC13, RTF13, Dicl2, FSW09, 
BEM13]. For concreteness, we use the the scaled Lasso [SZ12] given by 


{0,a} = argmin 
0eiRp,(T>o 




(29) 


The following proposition shows that the scaled Lasso estimate a satisfies the consistency criterion 
(27). 

Lemma 3.11. Let a be the scaled Lasso estimator of the noise level, see Equation (29), with A = 
lOy^(21ogp)/n. Then a satisfies Equation (27). 

We refer to our earlier work [JM14a, Appendix C] for the proof of Lemma 3.11. 

Furthermore, in the context of hypothesis testing, we can test the null hypothesis = 0 

versus the alternative Ha^i ■ 9o,i / 0. We construct the two sided p-values 


Pi = 2 


^1 - 4> 


Vn\ef\ \\ 


The decision rule follows immediately: we reject iLo,i if Pi < ct. 


(30) 


Remark 3.12. It is worth noting that the sample splitting approach, discussed in Appendix H, does 
not require Assumption {Hi) in Theorem 3.8. However as pointed in the introduction, this approach 
suffers from variability due to the random splitting and does not make use of half of the response 
variables. 


3.3.2 Unknown covariance 


We next generalize our result to the case of unknown covariance, where following [ZZ14, VdGBRD14] 
we construct the debiasing matrix M using node-wise Lasso on matrix X. For reader’s convenience, 
we first describe this construction. 

For i G [p], we define the vector = (7ij)je[p]V ^ by performing sparse regression of the 
z-th column of X against all the other columns. Formally 

7 j(A) = argmin + Ahlli) , (31) 

7eKp '-2n ) 

where is the sub-matrix obtained by removing the z-th column (and columns indexed by [p] \ z). 
Also define 


and let 


C = 


1 —71,2 

-72,1 1 


-71 ,p 

- 72 ,p 


-7p,i -Ip ,2 ■■■ 1 


= diag(ff,...,r^). 


= -{Xi- Xr..ifii)^Xi . 

n 


(32) 


(33) 
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Finally, define M = M(A) by 


M = f-‘^C. (34) 

Theorem 3.13 (Unknown covariance). Consider the linear model (2) where X has independent 
Gaussian rows, with zero mean and covariance U. Suppose that Assumptions {i), (ii), (in) in The¬ 
orem 3.8 hold true for S. We further let sq be the maximum sparsity of the rows of id = 

i.e. 


sn = max|{j / / 0}| . 

*e[p] 


(35) 


Let 9 he the Lasso estimator defined by (3) with A = nay/{logp)/n, for k G [8, Kmax] and let 9'^ be 
debiased estimator with M given by (34) and A = K^ffogpjn (with K a suitably large universal 
constant). Suppose that sq <C n/(logp). 

Then, there exist constants c, C depending solely on Cmin, Cmax, ^max such that, for n > csq logp, 
the following holds true: 


- 9*) = Z + R, Z\X 


I-RII 


< Cpa 



logp + Ca min(so, so) 


\ogp 

7^’ 


(36) 

(37) 


with probability at least 1 — 2pe pe ^, for some constants c*, c', c" > 0. 


The proof of Theorem 3.13 is deferred to Section 7. 

A result similar to Corollary 3.10 holds true for the case of unknown covariance. 

Corollary 3.14. Let a = a{y, X) he an estimator of the noise level satisfying Eq. (27) for any e > 0. 

Under the assumptions of Theorem 3.8, if mm{so, sq) <C ^/n/\ogp and sq <C n/{p{logp)‘^), then 
for all X gU. we have 


lim sup 

0o6MP;||e*||o<^io 


v^jof - 0*) 

a[MSMT]7 


< X > — <h(x) 


= 0 . 


(38) 


where M is given by equation (34). 

Using the above distributional characterization, we can construct confidence intervals for the 
individual model parameters 9* as in (12), (13) with M given by (34) and a given by the scaled 
Lasso as per ^29). For hypothesis testing task, two sided p-values can be built similar to (30), where 
we replace iLEQ with MSM"''. 

3.4 Numerical illustration 

Our goal in this section is to numerically corroborate the results of Theorem 3.8 and Theorem 
3.13. More specifically, we would like to check whether the debiased estimator exhibits an unbiased 
Gaussian distribution provided that the sample size scales linearly with the number of nonzero 
parameters. 
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We generate data from linear model (1) with the following configuration. We fix p = 3000 and 
consider regression parameter Oq with support Sq chosen uniformly at random from the index set 
[p] and 00,i = 0.15 for i ^ Sq and zero otherwise. The design matrix X has i.i.d. rows drawn from 
N(0, S), where S G is the circulant matrix with entries Sjj = The measurement noise 

w has i.i.d. standard normal entries. 

Let So = |5'o| and e = so/p be the sparsity level and 6 = n/p denote the under sampling rate. We 
vary e in the set {0.1,0.15,0.2,0.25,0.3} and for each value of e we compute critical value of <5 above 
which the unbiased estimator admits a Gaussian distribution. We will denote this critical value 
as 6c and define it as follows. We vary 6 and for each pair (e,5), compute the debiased estimator 
(with M = for 100 realizations of noise w. We then compute the empirical kurtosis of each 

coordinate Tj = y/n{9f — For i G [p], let yf denote the empirical kurtosis of Tj, 

where we make the dependence on 6 explicit in the notation. Denote by and SD( 7 '^) the mean 

and the standard deviation of = {jf, ... , 7 ^), respectively. We further define the standard error 
SE( 7 ‘^) = 50 ( 7 *^)/^^. We use one standard error rule to decide the value of 6c- Namely, 

6c = argmin{(5 G (0,1), s.t., < SE( 7 ‘^) } . (39) 

Figure 1 corresponds to e = 0.2. The dots indicate and the dotted lines correspond to 

771 ( 7 '^) ± SE( 7 '^). By one standard error rule, the estimated value of 6c works out at 6c = 0.57. 

Figure 2 shows 6c versus e. The black curve corresponds to the case of known covariance, where 
we set M = D and the red curve corresponds to the case of unknown covariance, where M is set 
as in Equation (34). The figure confirms that 6c scales roughly linearly in e (for small e). In other 
words, in order for the debiased estimator to have unbiased Gaussian distribution, the sample size 
n has only to scale linearly in the support size sq- (Note that for the circulant covariance chosen in 
this example, sq = 2). 


4 Minimax lower bound on the residual R 


In case that the design covariance matrix is unknown. Theorem 3.13 establishes the following high 
probability bound on the residual term R: 


l-Riloo < Cpa\ — logp + Ca min(so, sn)^^^ 
V n \ n 


(40) 


For sparse precision matrices, such that sq <C -v/n/(logp)) tbe residual term ||7?||oo vanishes asymp¬ 
totically under the near optimal condition sq <^nl (logp)^. The question we will study in this section 
is whether such condition on sq is necessary. To answer this question, we develop a minimax lower 
bound on ||i?||oo- This also clarifies the connection between our results and the ones of [GG15], whose 
general approach we build on here. 

Before presenting our results we need to introduce some notations and definitions. 

Consider the linear model (2) and define parameters of the form 7 = (0, D, cr^), which consists of 
the signal 0 , precision matrix D = and the noise standard deviation a. 

For a G (0,1) and a given parameter space F, denote by Xq(F) the set of all (1 — Q:)-confidence 
intervals for 0i over the entire space F, 

X„(F) = I J„( 7 /, X) : inf P.,(0i G J„(y, X)) > 1 - a} , (41) 

r 7sr j 
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Figure 1: Empirical kurtosis of the (rescaled) debiased Lasso estimator Ti = y/n{Of — 0*)/(cr[MEM]|{^). We 
plot the kurtosis m(j^) (over coordinates and 100 independent realizations) versus 6 along with the upper 
and lower one standard error curves, as a function of the number of samples per parameter S. Here, e = 0.2 
and Sc = 0.57 is our empirical estimate for the number of samples above which the debiased estimator is 
approximately Gaussian. 



e = So/p 

Figure 2: Critical number of samples per coordinate Sc, versus fraction of non-zero coordinates e. For S > Sc(s) 
the debiased Lasso estimator is empirically Gaussian distributed in our experiment. The approximately linear 
relationship at small e is in agreement with our theory. 
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where P-y is the induced probability distribution on {y,X) for random gaussian design X and noise 
realization w, given the fixed signal 9. Here and below we focus on the first coordinate 9i without 
loss of generality. For a given interval Ja{y, X) G Ia(T), we let i{Ja{ ■), F) be the maximum expected 
length over a parameter space F, 

= snpE^{£{J^{y,X))} , (42) 

7 er 

with E.y expectation with respect to We further define the minimax rate for the expected length 
of confidence intervals over F as follows: 


C(F)= inf (43) 

Jay ■ ) 

We next define parameter space F(so,so,/9) as follows. Applying inequality (23), we relax Condi¬ 
tion {Hi) as 11 1 loo < P and write 


F(so, sn,p) = jy = {9,0 , a^) : ||6»||o < so, G (0, c], 

(f^ )ii ^ 1) ^ ^ <^min(f^) ^ <7max(f^) ^ ^ ; ||^||oo — Pj 

Gmax Gmin 

max |{j / f, Vtij / 0}| < sol . (44) 

ie[p] J 

Quantities c, Cmin and Cmax > 1 are constant which do not effect the minimax rate and therefore 
we have not made them explicit in our notation F(so, so, p). 

Proposition 4.1. Consider a debiased estimator of form (5) with M being a function of X and 9 
the Lasso estimator at regularization parameter A. Further, let R = ^/n{M'E — 1){9 — 9*) be the bias 
term and Q = diag(MSM''') be the variance term. Suppose that there exist a choice of M and A 
such that 

J-i^P^sup |||i?||oo : ( 6 '*,n,fj^) G F(so,so,/9)| < = 1, (45) 

J-i^P^sup jllQIloo : ( 6 '*,n,fT^) G F(so,so,/o)| < = 1, (46) 

for some known A„ and for some known constant C. Then, we have 

4(F(so,so,p))<^^^. (47) 

V n 

Note that since Q is a function of only X, the arguments 9* and cr^ in Equation (46) are super¬ 
fluous. To establish the above upper bound, we construct a confidence interval using a debiased 
estimator, such that G X(F(so, so, p)). We refer to Section I.l for the proof of Proposition 4.1. 
The next proposition provides a lower bound on (F(so, so, p)). 

Proposition 4.2. Suppose that a G (0,1/2) and sq < min(p^, n/logp) for some constant 0 < p < 
1/2. Further, assume p > 1.02. The minimax expected length for (1 — a)-confidence intervals of 9i 
over F(so, so, p) satisfies 


^a(r(so,so,p)) 


-|- min 


n 


logp logp 
■So-,50 


n 


n 




(48) 
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Proposition 4.2 generalizes the result of [CG15, Theorem 2] which shows that without the sparsity 
constraint on 14 and the constraint ||14||oo < p, the minimax rate for expected conhdence interval 
length is lower bounded as £*(r(so,p)) > (l/v^ + sologp/n). Proposition 4.2 provides a more 
refined lower bound that takes into account the sparsity structure of the precision matrix. We refer 
to Section 1.2 for its proof. 

By comparing the upper and lower bounds on (r(so, so, p)), we conclude that the condition 
min(so, So) logp < y/n is necessary for having ||i?||oo < ^ 0. If this is not the case then 

An > min(so,sn)logp/\/^. 

In particular, in order to get A„ = o(l) at a nearly optimal condition sq n/(logp)^, we need 
the precision matrix to be sparse with sq < y^/(logp). 


5 Other applications 

Our main results, Theorem 3.8 and Theorem 3.13 establish a Gaussian limit for the debiased Lasso 
estimator. While our main motivation was the construction of confidence intervals for single coordi¬ 
nates of the parameter vector, we want to emphasize that the Gaussian limit has other important 
applications. We illustrate this point using three examples: (i) We establish a characterization of 
the Lasso estimator in terms of a certain denoising problem, (ii) We develop a new thresholded 
Lasso estimator and provide a tight characterization of its £2 risk. In the case of standard Gaussian 
designs this approach is minimax optimal up to a factor 1 -|-On(l). {in) We prove that the celebrated 
Stein’s Unbiased Estimate of the prediction risk [Efrl2] is consistent in high dimension an unbiased 
estimator, for standard Gaussian designs. 


5.1 A probabilistic approximation result for the Lasso 

As a first consequence of our main theorem, we obtain a precise approximation result for the Lasso 
estimator. In order to state this result, let r/s : —)• be dehned by 

77s(z) = arg mm _ ^)||2 ^ . ( 49 ) 

Note that the minimizer is always unique because S is strictly positive dehnite. In the case S = I, r/s 
coincides with component-wise soft thresholding at level A. More generally, r/s( ■) can be viewed as a 
denoising operator associated to the problem of estimating 9* from the noisy observation z = 6* +w, 
where w has covariance S. Our next theorem connects the Lasso to this denoising problem. 

Theorem 5.1. Consider the linear model (2) where X has independent Gaussian rows, with zero 
mean and covariance S, satisfying the assumptions of Theorem 3.8. Further assume the following 
condition: 


{iv) Letting C* = 32C'max/C'min; we assume ||oo < P for some constant p and all T C [p\, 

|r| < 20*50. 

Let 0^“=° = A; A) be the Lasso estimator with A = Ka^y {logp)/n, for k G [ 8 , Kmax]- Then, 

there exist constants c, C (depending on Cmin, O ma x. p, p, Kmaxj, such that for n > max(251ogp, csq log(p/so)), 
the following holds true with high probability. 


^Lasso -9.X'w 

n 


2 V n / 


(50) 
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Under the hypothesis of this theorem, the Lasso £2 error is known to be bounded as || 0 Lasso_^* II 2 < 
C{so\ogp)/n [BRT09]. Hence, Theorem 5.1 provides a characterization of the Lasso estimator that 
is one order of magnitude more accurate than what available in the literature. 

This characterization is particularly convenient if the population covariance has a simple struc¬ 
ture. For instance we obtain the following immediate corollary that characterizes the £2 error for 
standard designs. 

Corollary 5.2. Consider the linear model (2) where X has independent Gaussian rows, with zero 
mean and covariance S = 1. Let A) he the Lasso estimator with A = na^/jlogpfjn, 

for a constant k > 8. Then, for n > max(251ogp, csq log(p/so)) we have 

||0L—_r||2= ^ ¥.z{[0{e* + Zi-,X) - 0*]^] + Op ^^2 \/go^ogP ^ 

iesupp(0*) ^ 

(51) 


where expectation is taken with respect to Zi ~ N(0,1), and the Op{ ■) is uniform for k G [8, ^max]- 

Let us emphasize that this is not an upper bound, but an equality up to higher order terms. 
It provides a connection between the Lasso mean square error and the mean square error of soft- 
thresholding denoising in the classical sequence model. A similar connection was anticipated -for 
instance- in [DMMll, DJM13]. An asymptotic characterizations of the Lasso mean square error for 
standard Gaussian designs was first obtained in [BM12]. However, in the present case we recover 
this as a corollary of a result for general Gaussian designs, and in a non-asymptotic form. 


5.2 Minimax optimal estimation 

The analysis in the last section suggests that it is possible to reduce the estimation error through a 
two step procedure. For the sake of simplicity, we shall assume here that S is known. Our approach 
can be extended to imperfectly known covariance by using Theorem 3.13, but we leave this for future 
work. The suggested procedure is: 

(i) Gompute the Lasso estimator ^^aaso _ 0Lasso|-y^^. [logp)/n. 

{ii) Gompute the debiased estimator 0^ = 0 ^=-='*° + n~^CLX'^{y — 

{in) Gompute a new estimator by soft thresholding 9^ component-wise, namely 


W'’ 


n = 


2a‘^CLii log(p/so) 


n 


Here ? 7 (x;t) = (|x| — r)+sign(a:) is the scalar soft-thresholding function. 


(52) 


Let us emphasize that in the last step we soft-threshold at a level that is smaller than the 
regularization used in the Lasso. Indeed, since Liu < C'min’ have Tj = 0{\/\og{p/ sq)/ n), while A 
is of order sj{\ogp)/n. 
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Theorem 5.3. Consider the linear model (2) where X has independent Gaussian rows, with zero 
mean and covariance S, satisfying the assumptions of Theorem 3.8. Further assume so —>■ oo, 
sq/p —^ 0 and (so(logp)^)/n —)• 0. Let 9^“^^ he the two-step estimator defined above. Then 

||g(2) _ g*||2 < ^^0'^ log(p/so) I ^ ^ Oji I (1+ Op(l)) . (53) 

^ iesupp(6l*) / 

Note that, in the case S = I, the right-hand side of (53) is minimax optimal risk, up to a factor 
going to one as n,so,p —)■ oo [SC15]. Candes and Su [SC15] recently proved that SLOPE achieves 
the same guarantee for Gaussian designs with S = I. On one hand, the approach of [SC15] has the 
advantage of being adaptive to unknown sparsity level sq- On the other. Theorem 5.3 establishes 
this result as a special case of a guarantee holding for more general Gaussian designs. 

5.3 SURE estimate of the prediction error 

Define the Lasso prediction error as 

R{y,X,e*) = + -lltclli . (54) 

n" n 

Notice that the first term is the standard prediction error, for given design matrix X. The second 
term is the residual error that would be present even for the perfect estimator 9 = 9*. We include this 
contribution for mathematical convenience, but it is just a constant, independent of the estimator. 
The naive empirical estimate for the prediction error is 

= 11^. (55) 

n ^ 

Of course we expect the empirical risk to under-estimate the actual risk. Stein’s Unbiased Risk 
Estimate (SURE) provides a corrected estimate 

RsuRE(y,X) = -||y-X0"“-||2 + —||0"--||o. (56) 

This approach has a rich history for which we can only provide a few pointers. Donoho and Johnstone 
used SURE to develop an adaptive denoising procedure via wavelet thresholding. Erom the perspec¬ 
tive of linear regression, this corresponds to X being proportional to an orthogonal matrix. Efron 
[Efrl2] developed a general formula for estimating the prediction error, based on Stein’s ideas, and 
clarified the connection with classical model selection criteria such as Akaike’s information criterion 
[Aka74], and Mallows Cp [Mal73]. Zou, Hastie and Tibshirani [ZHT+07] showed that the number of 
degrees of freedom (which enters Efron’s formula) coincides with the number of non-zero parameters 
||0^“°°||o. They also proved that RsuREiu^X) is consistent in the classical low-dimensional regime 
n —)> oo with p fixed. 

To the best of our knowledge, this is the first case in which RsuRE(y) X) is proved to be consistent 
in high dimension (although in a restricted setting, namely for Gaussian designs). 

Theorem 5.4. Consider the linear model (2) where X has independent Gaussian rows, with zero 
mean and identity covariance S = 1. Let 0^“=° = X; A) be the Lasso estimator with X > 
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9a^{logp)/n. If n,p —)■ oo with sq = o(n/(logp)^), then there exists —>• 0 as n —)■ oo, such that 

the following holds with probability at least 1 — e~^^ — o„(l); 

|RsuRE(y,X)-R(?/,x,r)| + (57) 

' ' y/n n 

Let us emphasize a few important points: 

• The error bound in Eq. (57) is of smaller order with respect to the correction in (56) which 
typically is of order sou^/n. 

• The SURE risk estimate RsuRE(y,-^) is perfectly well defined for arbitrary design covariance 

S. 

• While our proof applies to standard designs, S = I, we expect the conclusion of Theorem 5.4 
to hold more generally. This is also conhrmed by the simulations discussed below. 

In Eigure 3, we present the results of a numerical simulation with p = 5000, n = 1800. We choose 
a subset 5" C [p] of size sq = |5| = 100 uniformly at random and set 0 q ^ = 0.1 if i G S' and 0 q ^ = 0, 
otherwise. The design matrix X has i.i.d random rows Xi N(0, S) with T,ij = rl* We set r = 0.1 
to illustrate a case of low correlation between predictors and r = 0.9 for a case of high correlation. 
In our simulations, we replace the noise level a appearing in Eq. (56) with an estimate a, obtained 
as follows. We hrst run scaled Lasso and then perform least square after model selection to mitigate 
the estimation bias. More precisely, we use the R-package scalreg with the default value for the 
regularization parameter in the scaled Lasso cost function. This selects a model S. We then perform 
least square on S to obtain an estimate 9^^. The noise variance is computed as a = \\y — X^^\\ 2 /y/n. 
The agreement between RguRE(y,-^) and R{y,X,6*) is excellent. 

Let us mention that [BEM13] also studied estimators similar to Rsurb(2/j-^), and related ideas 
were developed in [OK15] on the basis of non-rigorous but insightful statistical mechanics techniques. 
Other approaches to the risk estimation, e.g. [CG16], are based on sample-splitting, which has 
complementary shortcomings. 

6 Proof of Theorem 3.8 (known covariance) 

6.1 Outline of the proof 

Fix arbitrary integer i G \p]. In our analysis, we focus on the Lth coordinate 9*, and then discuss 
how the argument can be adjusted to apply to all the coordinates simultaneously. Our argument 
relies on a perturbation analysis. We let 9^ be the Lasso estimator when one forces 9f = 9\. With a 
slight abuse of notation, we use the representation 9 = {9 1 , 9r^i)?‘ Adopting this convention, we have 
0P = (0*,0U) where 

GZi = arg min Cy^x {9*, 9). (58) 

0 

Throughout, we make the convention that Cy^x {9*,9)^Cy,xm^(^))- 
^Or without loss of generality one can assume i = 1. 


20 







(a) r — 0.1 (b) r = 0.9 

Figure 3: Lasso prediction error R{y,X,9*), empirical prediction error R(y,X), and SURE estimator Rsure 
curves versus A for the simulation setting described in Section 5.3. 


We observe that can be written as a Lasso estimator. Specifically, by definition of Lasso cost 
function we have 

Cy,x{e*,e) = ^\\y- + m\ + a || 0 ||i . 

Letting y = y — Xi6^ = w + we obtain 


eli = aigminCy^x^iiO). 

O 


(59) 


Let Vi = XQ,ei and expand Of — 9* as follows: 


1 


^/^{ef -e*) = V^Oi + —ejnx^iy - xo) - 


n 


= y/nOi + 


n L 


w 


+ xi{e* - Oi) + x^iiox - e^i) - y/^e* 


= y/n( 1 - -{vi,Xi) }(9i - 0*) + ^ w + Xr^i(0X - O^i) 
n 


n 


(60) 
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We decompose the above expression into the following terms: 




n 


= Vn( 1 - 


{Vi,Xi) 


n 




p(2) — Y !Q* flP ^ 


n 


Rf) = - e. 


n 


(61) 


The bulk of the proof consists in treating each of the terms above separately. Term Zj gives the 
Gaussian component Z in equation (25). For bounding R) , note that 0 T is a deterministic func¬ 
tion of {y^Xr^i) (and thus a deterministic function of {w^Xr^i)) by Equation (59). Further, vi is 
independent of X^i^ as per Lemma 3.6, and independent of noise w. Hence, vi is independent of 

/ o \ 

X^i{9X — 0^ j). Bounding R- relies on a perturbation analysis showing that the solutions of Lasso 
9 and its perturbed form 0 p, are close to each other. 


6.2 Technical steps 

Let Z = (Zj)i<j<p. We rewrite Z as 

Z = -^Q.X'^w. 

\/n 

Since w ~ N(0, cr^I) is independent of X, we get 

Z\X ~ N(0,a2HSH). 

Let R(^) = (Ri^^)f=i, R^^^ = In the following, we provide a 

detailed analysis to control the terms , R^‘^\ 

• Bounding term R^^h Recalling the definition Vi = XHcj, we write 

rW = ^(l - ^eJnx'^Xei^ (9i - 9*). 

Therefore, 

<^/n|I-L!S|oo||0 -r||2 . 

For H > 0, let = Gn{A) be the event that 

GniA) = [xe : |HS - I|oo < . (62) 

Using the result of [JM14a, Lemma 6.2] for n > (^^Gmin)/(4e^Gmax) logp we have 

¥{Xeg„{a))>l-2p-‘, 
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(63) 


By choosing A = 10e^^/ O^a^ /Cmin we get c > 1. Therefore, provided that n > 251ogp, 

P(X G QniA)) > 1 - 2p-^. 

In addition, on the event B = Bs{n, sq, 3) n B{n,p) we have [BvdGll] 

Vw 


\\ 9 - e *\\2 < 




(1 - 

Combining the above bounds, we obtain that on event Gn{A) n B, 




bnAa 


(1 - SyCn 


Iso , 

— logp. 
n 


Bounding term R^‘^'1: To lighten the notation, we define 

0 = - e). 


(64) 


(65) 

As discussed is a Lasso estimator with design matrix and response vector y = y — Xi9^, as 
per equation (59). We recall the following results on the prediction error of the Lasso estimator, 
which bounds HCilb- 

Proposition 6.1 ([BvdGll], Theorem 6.1). Let S = supp(0T). Then on the event B{n,p), we have 
for A > 8cry^(logp)/n, 

iioiii < • 

From the definition of the compatibility constant (cf. Definition 3.1), it is clear that </>^(5', > 

4>'^{S, S). Therefore, combining Proposition 6.1 and Remark 3.3, we arrive at the following corollary: 

Corollary 6.2. On the event B = Bs{n, sq, 3) n B{n,p), we have for A > 8a^/{logp)/n, 


Il0lli< 


dA^so 


(1 - 5)2Cn 


( 2 ) 

Employing Corollary 6.2, we derive a tail bound on ii) L 
For i G \p] define the event 

lA^so 


£i- |||Ci|l2 < (1_ 5)2(27^ 

By Corollary 6.2, we have B T Si for i G \p]. Hence, for any value t > 0 

, > PjB') <f( max \vj(i\ >t;Sj 
/ ' is[p] 


( 66 ) 


pmaxE|l(|u7Cj| > t) ■ I[(Ti)| 

ie[p] 1 1 


< 


< 2pmaxE( exp 
*6[p] 




I(T, 


< 2pexp ( — 


cA 
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with c* = (1 — (5)^C'niin/8. In the third inequality, we applied Fubini’s theorem, and first integrate 
w.r.t Vi and then w.r.t C,i using the fact that Vi and Q are independent. Note that Vi ~ N(0, flnlnxn) 
and thus ~ N(0, fljjHCjlp). Further, on the event Si, ||Ci|P can be bounded as in Equation (66). 

Setting t = fi;cTy^2so/(c*C'minnO ^ogp, we get 

IP(||-R^^^||oo > -^p^logp]B^ <2p-^. (67) 


• Bounding term : In order to bound the last term, we first need to establish the following main 
lemma that bounds the distance between Lasso estimator and the solution of the perturbed problem. 
We refer to Section 6.3 for the proof of Lemma 6.3. 

Lemma 6.3 (Perturbation bound). Suppose that < 1, for i € [p]. Set A = 8a^y{logp)/n and let 
B{Cs) = B{n,p) n Bs{n,CsSo,3). The following holds true. 


'Or^i - OZih > C'^',B{Cs)] < 2exp ( - — ) + exp ( - 


So 


n 


1000 / ’ 


where, 


C 


,, _24/3(1 + (5)\/C^ 
(1 - 5)2Cn,in 


C* = o (I ~ <5)^C'min ; 




16C„ 


(1 - 


+ 1 . 


We are now ready to bound term 7?^^^. 


( 68 ) 


< -^\\v]x^iW^pl^-e^iwi 

Vn 


< 


^5'SOii I 

Vi 


n 


2 


< ^Csson\nt-iu\el,-t 


■i\\2 : 


where in the first inequality we used Lemma 3.5, which implies that — 0^j||o < 6/(550) under B. 
Therefore, by Lemma 6.3 and equation (63) and since B{Cs) C B, we have 


> C'aJ — 


n 


logp-,B{Cs)) 




< 2 exp (-) + exp ( — 


n 


1000 


+ 2p 


-2 


with C" = (67* + 1)^C". Hence, by union bound over the p coordinates, we get 


OO — 


C''a logp]B{Cs)J < 2pexp 


-) + p exp ( - 

So 


n 


1000 


+ 2p 


-1 


(69) 


We are now in position to prove the claim of Theorem 3.8. 
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Using equations (60) and (61), we have ^/n{6^ — 9*) = Z + R, where Z\X ~ N(0,and 
R = R^^'^ + + R^^\ Combining equations (64), (67) and (69), we get 


oo>C^logp-gn{A)nl3{Cs)) <2pi 


c*n 


< 2p exp (-) + p exp ( — 


n 


1000 


+ 4p 


-1 


where C is given by 


C = Ka 


5A 


(1 - 6)^Cn 


+ 




+ 


Further, for n > max(25 logp, ciC^sq log(p/so)), we have 


gn{A) n BiCsW] < nOniAr) + F0{n,pr) + F{Bs{n, Cssq, 3)^ 


< 2p-^ + + 2e"'^"’" = 4p"^ + 

where we used bound (63), Lemma 3.2 and Lemma 3.4. 

The result follows from equations (70) and (72), and setting <5 = 1 — \l\f2. 

6.3 Proof of Lemma 6.3 (perturbation bound) 

Lemma 6.4. For all 9 G the following holds true. 

^\\X^i{0 - - AxA.OA ■ 

Lemma 6.4 is proved in Appendix C. 


(70) 


(71) 


(72) 


(73) 


Lemma 6.5. Let fk{x) = |(x — a — uA + for k = 1,2. Further assume that min^; fi{x) < 

niinj,/ 2 (x). Then, 


fl{a) - / 2 (a) < (C|U2 | + A)|mi - U2\ + -{Ui - U2f 


Lemma 6.5 is proved in Appendix D. 
Lemma 6.6. For 9 G define 


..,n\ - xj(w + X^i{9li - 9)) 
= ■-fP~7]2-- 


(74) 


(75) 


Also let Ci = ||xi|p/n. Then, the following relation holds true. 


C{9„ 9) = m + ^{e^- 9* - u{9)f - Au{9f + C{9*, 9) - A|0;| 


(76) 


Lemma 6.6 is proved in Appendix E. 
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Welet/i(x) = £{x,9r~.i) and f 2 {x) = £(x,0^J. Note that (0j, is the minimizer of ^~i)' 

Therefore, mm/i(x) = C{9i,9^i) < min/ 2 (x). Using decomposition (76) and applying Lemma 6.5 
with 


c = cj, a = 9*, ui=u{9^i), u2=u(9^-), 

h = -^u{9^,)^+C{9*X^)-m\, 
b2 = -^u{X?+C{9*,9X-m, 


we obtain 


C{9*,91,) - Ci9*,9X < (cMOXl + A)|n(0.,) - u{9X\ + f - n{9X? 


(77) 

(78) 

(79) 


(80) 


We next write 

|(u(L,) - u{9lX = ^(L, - X)^xZ,XixJX^,Xi - X) = - X)\\l, (81) 

where P^- = XixJ /||xj|p denotes the projection on the direction of Xj. 

We lower bond the left-hand side of Equation (80) using Lemma 6.4 and employing the above 
identity to get 


- X)f2 < {cMX)\ + A)|n(L.) - u(C,)| 


(82) 


Next preposition bounds Ci\u{9'Zi)\- We defer the proof of Proposition 6.7 to Appendix F. 

Proposition 6.7. Let B = B{n,p) n Bs{n, so, 3), where the events Bs{n, so, 3) and B{n,p) are given 
as per equations (16) and (18). The following holds true. 

IP’(|ciw(^i)| > 1.25pA;i3^ < 2exp . 

where c* = (1 — 5)^C'min/8. 


We further have 


\u 


Xi)-u{9lf)\ = 


\xrX^i{X-9^i)\ ^ \\X^i{9l^-9. 


(83) 


We next upper bound the term \\Xr.,i{9r.^i — 0^j)||. 

The corollary below follows from Proposition 3.5 and its proof is given in Appendix G. 

Corollary 6.8. Set A = ® constant k > 8. On the event B{C^) = B{n,p) n 

Bs{n, (C* -|- l)so, 3), the following holds. 

< {^ + 5fCraX0^i-Xf ■ (84) 
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Corollary 6.8 is proved in Appendix G. 

We next lower bound ||ii|| 2 - Observe that the entries — 1, I G [n], are zero-mean sub¬ 
exponential random variables. We obtain the following tail-bound inequality by applying Bernstein- 
type inequality for sub-exponential random variables. (See e.g. [JM14b, Equation (190)].) 


XiW > 




(85) 


Combining the results of Proposition (6.7) and equations (84) and (85), we obtain that on event 
B, with probability at least 1 — e-"-/iooo _ 2 g-c*n/so^ following holds: 


- c,)||i < I2p(i + 5)v^A||Li - ez,\ 

The last step is to lower bound the left-hand side of Equation (86). Write 

- 0P,) = - CJ - P^W^,(Li - CJ 


( 86 ) 


= x^i(e^i -oZi)- X 


^ \ II ~ ||2 

^iW 




Define vector fi €MP with 


IJ-i — ^ II ~~ ||2 ) ) fir-.i — 9r^i . 

' 11^2 II ' 

Then // G C(C'5S0)3), by Proposition 3.5, with = C* -|- 1 . Hence, on the event B{n,Csso,3), we 
have 

>^(l-5)2c^in||0^.-C,f. (87) 

Finally, note that C(so,3) C C(C,5So,3), since Cs > 1. Therefore, Bs{n,Csso,3) C Bs{n, so,3), by 
definition. Letting B{Cs) = B{n,p) n B{n,Csso,3), we have B{Cs) C B. Combining equations (86) 
and (87), we obtain 


P 



0? 


ill ^ 


24p(l -|- 5) VCmax 

(1 - 


X-,B{Cs)) 


< 2 exp 


c*n 


So 


-I- exp 


This completes the proof. 


n \ 
1000 / ■ 


( 88 ) 
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7 Proof of Theorem 3.13 (unknown covariance) 

We decompose y/n{6'^ — 9*) into three terms: 

^(0^-0*) = ^m-0*) + J-MX'^(y-Xe) 

yTi 

= M%)(e-9*) +^MX'^w 

y/n 

= y/n{l - n^)(9- 9*) + y/^{n - M)t{9- 9*) + ^MX^w . 

'' -V-^ '-V-^ y/n 

h h ' -' 

h 

Note that the term Ii is exactly the bias vector R of the debiased estimator in case of known 
covariance (with M = Q). Therefore, by invoking the result of Theorem 3.8, we have 

P(||h||oo > C^^logp) < + 8p-i + . (89) 

We next provide two bounds on II/ 2 IIco¬ 
in our hrst bound, we use duality of £00 norm (on E(0 — 9*)) and £1 norm on rows of Q — M as 
follows: 


II/2II00 < y/^\\n - M||oo||S(0 - r)||oo . (90) 

By the KKT condition for 9, there exists a vector ^ in the subgradient of the £i norm at 9, such that 
T,{9 — 9*) = X'^w/n — . Therefore, 


l|s(0-r)iioo < -i|xTu,||^ + A|ieiioo. 


n 


We have ||C||cxd < 1 and on event B{n,p), 


1 

n 


1^ "a^lloo 


< 2(7 




(91) 


(92) 


Using these bounds in Equation (91), we obtain ||S(0 — 0*)||oo < 5A/4. As proved in [VdGBRD14, 
Theorem 2.4], we have ||M — ll||oo ^ SQy/{logp)/n. Combining these bounds in Equation (90) gives 
our hrst bound on 12 - 


\ n \ n 


logp jlogp ^ sn logp 


n 


To obtain a second bound on I 2 , we proceed by writing I 2 as 

I 2 = \/nf(12S - I) - (MS - 1)1 (9- 9*). 

Therefore, 

II/2II00 < - Ijoo + IMS - I|oo) \\9- rill. 


(93) 


(94) 


(95) 
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On event Gn{A) (see Equation (62)), we have |QE —I|oo < (logp)/n. Further, for sq <C n/logp, we 
have I MS — I|oo ^ \/{^ogp)/n. For the proof of this inequality we refer the reader to [VdGBRD14], 
Equation (10) and Lemma 5.3 therein. In addition, on the event B = Bs{n, sq, 3) n B{n,p) we have 
11^- < sqA ~ soy^{logp)/n. (See e.g., [BvdGll].) 

Combining these bounds, we arrive at 

(96) 

yn 

We summarize bounds given by (93) and (96) as 

||^2||oo < min(so,so)i^^ . (97) 

\/n 

Finally, note that 

hlx ~ n(o,o-2msm'^) . 

The result follows by letting Z = I 3 and R = Ii + 12 - 
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A Proof of Lemma 3.5 


This proposition is an improved version of Theorem 7.2 in [BRT09]. 
We first recall the definition of restricted eigenvalues as given by: 


0max(A:) = max 


{v,T.v) 


l<||t»||o</c ||'^||2 

Clearly, 4>raax{k) is an increasing function of k. 

Employing [Verl2, Remark 5.4], for any 1 < k < n and a fixed subset J C [p\ with \ J\ = k, we 
have 

n y/ny 

for t > 0, where C and c depend only on Cmax- Therefore, by union bound over all possible subsets 
J C [p\ we obtain 


C’'max(Sj^j) > Cmax + Cxl 1-< 2e 


,ik) > Cmax + c^j- + — ) <2{^ ]e <2e 


P\^-cP ^ 9^-ct2+felogp+fc 

kj 


(98) 


for t > 0. 

Let S = supp(0). Recall that the stationarity condition for the Lasso cost function reads X'^{y — 
X6) = nXv{6), where v{9) G 9||0||i. Equivalently, 

-X'^X(9* -9) = X v(9) - -X'^w . 
n n 

On the event B{n,p), we have ||X'''r(;||oo < nA/4. Thus for all i G 5 


^X'^X{9*-9)], 


n 


X 

> - 
- 2 


Squaring and summing the last identity over i € S, we obtain that, for h = n ^^‘^X{9* — 9), 


A2 

4 


1^1 ^ = (h^lXgXlh) < W^s^sWlWhf < ((>„.ax(|5|)||L|| 


li- 


(99) 


ies 


By a similar argument as in Corollary 6.2, on the event B = B{n,p) n B{n, so, 3) we have 

4A^so 


^ < 
2 — 


(1 - 6)^Cn 


Thus, 


1 ^ 1 ^ 16(/>max(5’) 

- (1 - 


( 100 ) 


Note that I^I < n by the fact that the columns of X are in generic positions. Using monotonicity 
property of i;i>max(')) have ^max(|*S'|) < (t> m»A n). Invoking equation (98) with k = n, we have 
4 >ma.xin) < ciVlogp with high probability for some constant ci. 
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Hence, by equation (100) 


|5|<CsoV^, (101) 

Now, we use this bound on [S'! along with equation (100) to get a better bound on |5|. Again by 
using the fact that (pma^ik) is a non-decreasing function of k, we have 

^max (1^1) < 0max (Cso^/logp) ^ Cmax ; (102) 


with high probability where we used the assumption n 3> so(logp)^. 
tion (100), we get 


| 5 |< 


16Cn 


(1 - 5)2C'n 


-So . 


Using this bound in equa- 


The result follows. 


B Proof of Lemma 3.7 

By definition of ioo operator norm, for a symmetric invertible matrix A we have 

1 




^"oo = max 


A ^u|| 

v^O ||'?^||oo 


oo 

— = max ■ 


u^o ll-Azil 


oo mm. 


Uf^O 




Note that for any set T C [p] we have 

... \\M\ 


oo . . ||^r,rfi|| 

mm —I—- < mm ■■ ^ ^ ^ , — 

u^O U oo u^O U oo 


whence we obtain 


(103) 




“ - \At.tu\\c 


= IIA, 


-1 II 

j' j'Woo • 


(104) 


Since the above inequality holds for any T C [p], we obtain the desired result. 


C Proof of Lemma 6.4 

For 0 we have 

Cy,x{e*,e) = ^\\y- x^e* - + a||0||i + a|0*| 

Let y = y — XiO*. We then have 

Cy,x{ohe) = ^\\y- + ^ll^lli + ^1^*1 

= Cy,x{ei,eld + - el,)f - ^{y - a^,(0 - 0U)) 

+ A||0||i-A||^J|i (105) 
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( 106 ) 


Since is the minimizer of &) by KKT condition we have 

Applying equation (106) in equation (105) we get 

Cy,x{(^*,9) - Cy,x{01,0l^) = + '^(ll^lli - ll^-lli - - o) 

where the last step follows from the definition of a subgradient. 


D Proof of Lemma 6.5 

Define Xopt,i = argmin^;/i(x). It is simple to see that Xopt,i = f/(a + lii; A/cj), where rj(x;a) is the 
soft-thresholding function given by 


r]{x; a) = < 


X — a X > a , 

0 \x\ < a , 

X + a X < —a . 


By substituting for Xopt,i in equation (76) and after some algebraic manipulations, we obtain 

/i(a:opt,i) = Cil-L{a tti; A/q) hi , 
where 77(x; a) is the Huber function: 


77(x; a) = < 


q;|x| — if |x| > a, 


Y if |x| < a . 

Similarly, setting Xopt ,2 = arg min ^,/2 (x) we have 

/i(xopt, 2 ) = Cil-L{a U 2 ; A/cj) 62 • 

Define Ai = /i(a) - /(xopt,i) and A 2 = / 2 (a) - /(xopt, 2 )- Substituting for /i(a) and / 2 (a), we 


get 


Ai = Ci Y -h A|a| - CiV.{a + ui] X/ci ), 


ul 


A 2 = Cj Y -h A|a| - Ci%{a + U 2 ; X/ci). 


(107) 

(108) 


We then write 


/i(a) - / 2 (a) = Ai - A 2 -b /i(xopt,i) - / 2 (xopt, 2 ) < Ai - A 2 , 


(109) 
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where we use the assumption min^; fi{x) < miuj; f2{x). 
Finally we bound Ai — A2 as follows: 


Al - A 2 = C; 


tit — tin 


+ 


Cj|?^(a + U2] X/ci) - T-L{a + ui; A/ci)| 


^ / N ("*^1 “ ^2)'^ , I I 

< CiU2{ui - U 2 ) + + A|rti - rt2| 


where the last inequality holds since T-L'{x; a) = x — rj{x; a) and hence a)| < a and due to the 

mean-value theorem. 


E Proof of Lemma 6.6 

To lighten the notation, we drop the subscripts in Ptecall that A(^) = C^y AOhO)- 

£■*■(0). We start by expanding C{9i,9). 

m, 9) = ^\\y- xA - X^i9\\l + X\9*\ + A||0||i. 

Plugging in y = Xi9* + X^i9*^^ + w and rearranging the terms, we obtain 

£(0„ 9) llu; + X^,(9X - e) ||2 + -{9* - 9i, xj(w + X^,{9X _ 0))) 

2 n n 

~ + A|0i| -|- A||0||i. (110) 

2 n 

Therefore, 

m,9) = ^\\^+- ml + m\ + aii^iii . (m) 

Combining equations (110) and (111), we rewrite C{9i,9) as 

Ci9i,9) =C{9*,9) + -{9*-9,Al{w + X^i{9U-e))) 
n 

+ ^\\xi\wn-9i)mx\9i\-x\9*\ 

In 

=X\9i\ + ^\\xif{0i - 9* - + X^i(9X - e))y 

- {xJ{w + X^,i9X - 0))y + Ci9*,9) - A|0*| . (112) 

Writing expression (112) in terms of Cj = ||xj|p/n and u{9), given by (75), we get 

C{9„ 9) = XA\ + %{e^- 9* - u{9)f - %(9f + £(9*, 9) - X\9*\. ( 113 ) 

The result follows. 
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F Proof of Preposition 6.7 

Let T = supp(0^j) U supp(0*). By Lemma 3.5, |T| < where 


Cs — C** + 1 


16 C„ 


(1 - 5)2 Cn 


+ 1 . 


J = J\{*}. For i G Ip] define 

^i\T = '^i,i — ^i,T{^T,T)~^^T,i ■ 

Since Xi and Xt are jointly Gaussian, we have 

Xi = z , (114) 

where 2 ; G M”' is independent of Xt with i.i.d standard normal coordinates. 

Recalling the definition of c* = ||xi||2/n and u{9), given by equation (75), we write Ci\u{6'^-)\ as 



xJ{w + X^i{9X-el,)) 
xJiw + XT{9*T-9P)) 


1 


a/2 


< -\xi w\ + -S.|y 


n 

< ^ \r' ?/)l -I- ^ y^/2 


n 

n' 


z^Xt{9*t-91) 

z'^Xt{9*t-91) 


1 

H— 
n 

1 




+ -||Si,T(ST,T)“ii||X?^r(0T - ^r)lloo • 
n 


(115) 


The hrst inequality here follows from equation (114). 

In the following we bound each term on the RHS of equation (115) individually. 
On the event B{n,p), defined by equation (18), we have 


1 II ~T 
— \\Xj W 

n 


< hx^w 

n 


00 


< 2 a 




(116) 


We use Corollary 6.2 to bound the second term of expression (115). We recall the event 
Bs{n, so,3), given by equation (16) and let B = Bs{n, so,3) n B{n,p). Further, recall the nota¬ 
tion Q = Xr^i{9'^^ — 9 X)/^^ = Xt{ 9^ — 9'?p)l^/n and the event T, defined by equation (66). We 
write 


4=5^!|r > A;B) < 


= E 


n 


n *1"^ 

{i(J^sY/iac.i>A).i(a)} 

nA2 1 


< 2E( exp 


2||0 


m 


n 


< 2exp(—c* — 
•So 


(117) 
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with c* = (1 — (5)^C'min/8. Here, the penultimate inequality follows from Fubini’s theorem where we 
first integrate w.r.t z and then w.r.t Q. Note that and Ci are independent. Therefore, C,i\Ci ~ 
N(0, IlCilP)- In the last step, we applied Corollary 6.2. 

We next bound the third term on the RHS of equation (115). Note that the KKT conditions for 
optimization (59) reads 


^xUw + x^,{ei,-ei;)) = K, 


(118) 


for ^ E Since 9*^- - 0T is supported on T, we have X^i{9*^- - = Xt{9^ - 9^). To 

lighten the notation, let 


u = -X^Xt(9t - 9^) . 
n 


a?' 


We know by equation (118), 

On the event B{n,p) we have 

Combining the above two inequalities we obtain 


n 


^ II vT II ^ o ^ A 

-\\Xj^w\\oo < za\ - < - 

n V n 4 


^||oo ^ 5A/4. 


(119) 


We next employ Condition {iii) to bound ||Sj'^(S t’^t’) ^||i. Define T = T U {i} and write the 
inverse of Sj, j, using Schur complement: 

1 _ / ~^i\T^i,T^T,T 

By Condition (in) and as |r| < C^sq, ||S^hej||i < p. Further, by Condition (i), T,i\T < ^ 1- 

Hence, we get 



P > ||S^hei||i > 1 + ||Si_T(HT,r) ^||i • 


( 120 ) 




Using equations (116) to (120), we bound the RHS of equation (115) as follows. Under the event 

Ci|u(Cj| < ^p. 


G Proof of Corollary 6.8 

Note that 9^^ is the Lasso estimators corresponding to {y,X^i), according to equation (59). As a 
corollary of Proposition 3.5, on event B, ||0((,j||o < Gsq, with C* = (16C'max/C'min)(l — Also, 

||^~i||o < So- Therefore,(0, E C((C'*+l)sO) 3) and, by definition, on event ^^(n, (C'*+l)s 0 ) 3), 

the claim holds true. 
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H Sample splitting techniques 


In this appendix, we discuss how sample splitting can be used to modify the debiased estimator as to 
go around the sparsity barrier at sq = o(\/n/ logp). This provides an alternative to the more careful 
analysis carried out in the main body of the paper, that we discuss for the sake of simplicity. As 
mentioned in the introduction, sample splitting has its own drawbacks, most notably the dependence 
of the results on the random data split, and the sub-optimal use of all the samples. 

For the sake of notational simplicity we assume here that the number of samples is 2n and is 
randomly split in two batches of size n: {xi,yi), ..., {xn,yn), and {xi,yi ),..., (x„,y„). Note that 
the change of notation only amounts to a constant multiplicative factor in the sample size, which is 
of no concern to us. In vector notation, these batches are denoted as {y,X) and {y,X). We then 
proceed as follows: 

1. We use the second batch to compute the Lasso estimator, namely 

= argmax|^||y-Aeila + All^llij . (121) 

2. We use the first batch to compute the debiasing matrix M, e.g. using the node-wise Lasso as 
in Section 3.3. 

3. We use the first batch to implement the debiasing, namely 

X) + -MX^ {y - xe{y, X)) . (122) 

n 

The main remark is that, thanks to the splitting, X is statistically independent from 9, which greatly 
simplifies the analysis. Notice that we did not use the responses in y. 

For the sake of simplicity, we shall analyze this procedure in the case in which the precision matrix 
Q is known, and we hence set M = VL. The generalization to M constructed via the node-wise Lasso 
is straightforward as in the proof of Theorem 3.13. 

The next statement implies that, for sparsity level sq = o(n/(logp)^),the sample splitting debiased 
estimator is asymptotically Gaussian. 

Proposition H.l. Consider the linear model (2) where X has independent Gaussian rows, with 
zero mean and covariance S. Suppose that S satisfies the technical conditions of Theorem 3.8 

Let 9 he the Lasso estimator defined by (3) with A = 8cty^( logp)/n. Further, let be the mod¬ 
ified (sample-splitting) debiased estimator defined in Eq. (122) with M = Q = Then, there exist 

constants c,C depending solely on Cmw, 6 and p, such that, for n> c max(logp, sq log(p/so)) 
the following holds true: 

y/n{9^ - 9*) = Z + R, Z|A ~ N(0,fj2QSQ), (123) 

J.i^p(^||i2||oo > Cy^logp^ = 0 . (124) 

Proof. Proceeding as in the proof of Theorem 3.8, it is sufficient to bound the bias term of y^(0®P^'* — 
9*), which is given by (cf. (8)) 

R = ^/n{n^:-l){9* -9). (125) 
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To lighten the notation, let u = 9* — 6. Expanding R we get 

1 ” 

R = y/niVLT, — T)u = —= (QxixJ — I)tt. (126) 

Wn 

1=1 

To control ||-R||ooj we bound each component Rj individually. Let ej be the j-th element of the 
standard basis with one at the j-th position and zero everywhere else. We write 

1 ” 

Rj = Y^{e]nxi){xju) - Uj . 

V” i=i 

Let Zi = {ejQxi){xJu) — Uj. Note that conditional on {y,X), 6 and therefore u are deterministic. 
Furthermore, since the hrst batch {y,X) is independent of {y,X), the rows Xi are independent 
conditional on (y, X). Therefore, Zi\{y,X) are independent with 'E{Zi\y,X) = eJnSu — Uj = 0. We 
let II • ll^j and || • ||^2 respectively denote the sub-exponential and sub-gaussian norms and condition 
on {y,X) in the sequel. As shown in [Verl2, Remark 5.18], 

ll^illv^i < 2 \\{e]nxi){xju)\\^^ . 


In addition, for any two random variables v and w, we have ||ur(;||^^ < 2 ||u ||^2 ||rc||i/, 2 . Hence, 

||(ejHxi)(x7w)||v>i < ‘2\\ejnxi\\^^\\xju\\^^ 

= 2||ejHV2||2||0V2a,.||2j|j^-l/2^||2 

< 2^Cru»^jCrr,\n || ||^J|m|| 2 ■ 

Given that ~ N(0,I), we get ||H^/^Xi ||.02 = 1- Hence, maxj||Zj||^^ < C||m ||2 with C = 

dA/C ma ^r/ GmiTi . Applying Bemstein-type inequality [Verl2, Proposition 5.16], for every t > 0, we 
have 


P 




> t 



< 2 exp 


— cmin 


/ ty/n y 

VC'2||u||2’C||n||2/J ’ 


(127) 


where c > 0 is an absolute constant. Observe that on the event B = Bs{n, sq, 3) n B{n,p)^, we have 

Ml = \\O*- 0 f 2 <soX^. 


Therefore, by using tail bound (127) and applying union bound over the p entries of R, we get (for 
n > clogp with c a suitable constant) 


PI|oo< 



with high probability. 


□ 


^See Section 3.2 for definition of Bsin, so, 3) and B{n,p) 
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I Proof of Propositions 4.1 and 4.2 


I.l Proof of Proposition 4.1 

Fix M and A for which Equations (45)-(46) hold true and let 

= e+^Mx^{y-xe). 

n 

We construct confidence interval centered at 6 ^ as follows: 

= [of - 5{a,n),6f + 5{a,n)] (128) 

6 {a,n) = 4>“^(l-a/2)-— ^ minjav^, \/{I + e)cC} + ^ , (129) 

fl ~ £j\/^ V ^ 

where £ G (0,1/2) is arbitrary fixed value and 4>(x) = is the Gaussian distribution. 

Further, recall that c is the bound on a in the definition of r(so, Sfi, p). 

We have 

£(4) < - a/2) - ^ 4 (1 +e)cC + ^ , (130) 

(1 - e)^n y/n 

and therefore, E..),{t'( J)j)} < (1 + An)/y/n. 

We next show that jf £ ^/^(r). Define the following events: 


81 

= {(1 - e)cr < a < (1 + £)(t}. 

(131) 

82 

= {||Q||oo<G}, 

(132) 

^3 

= {ll-Riloo < A„}. 

(133) 


We further let S = 81 ( 182^1 and Z = e'lMX'^ w / y/n. Since Zj 4n\X ~ N(0, a'^Qiln), we have 


P 




a/2) 


X = 1-a. 


(134) 


By integrating w.r.t X we get the same coverage probability unconditionally. Note that on event 8 , 
^y/Qi < ■y/(1 + e)a‘^C and on r(so, -sri) p), we have a < \/c. Further, o < u/(l — e). Hence, on event 
£: 

(5(a,n) = 4>“^(1 - a/2) ^ ^ 

(1 — £) V n y/n 

> 4>“^(1 — a/2)(T'\/— H—// = 5o(a, n). (135) 

V n Jn 
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We have the following bound on the coverage probability 
p( 0 * E jd) = P(| 0 d -e*i\< 5{a,n)) 

>¥{{\ef-ei\<6o{a,n)}n£) 

>f(^\Z\ < a/2)) - F{£^) 

^ W Tl y Tl / 

F{£^) = P(f) - a , 


(136) 

(137) 

(138) 

(139) 

(140) 


where (a) follows from the decomposition = 0*1 + Zj^/n + R/^/n and the fact that ||-R||oo < 

on £] (b) follows from Equation (134). Since P(£l) —>• 0, we obtain 

liminf inf P.y(0T E jd) > 1 — a. (141) 

n^oo 7er(so,sn,p) 

Therefore, as claimed, 

C(r(so,sn,0)) < E^W4)} < (1 + A„)/V^. (142) 


1.2 Proof of Proposition 4.2 

The proof follows the same lines as [CG15] [Theorem 3]. Under the gaussian design model, the data 
pairs {yi,Xi) has a joint gaussian distribution with mean zero and covariance S, where S admits the 
following block decomposition: 

tyy tyA _ + (T^ 0^^^ 

“ V S0 S 



(143) 


where we posit the model y = X6 + w with w ~ N(0,(T^I). (Throughout this section, we simplify 
our notations by writing 0 instead of 0* for the true model parameters.) We also define PSD(p) = 
{M E : M ^ 0}, the set of positive semidefinite matrices of size p. 

Notice that there is a one-to-one map between the parameter space T = {7 = (0,U,(T^) : 0 G 
W,£l E PSD(p), cT^ G M+} and PSD(p-|- 1). Specifically, define the function h : PSD(p-|- 1) i-> T as 
^(U) = {{^xx) ; ? ^yy i^xy)'^ i^xx) ^^xy)- The inverse map h ^ is given by 


h ^((0,P,o-^)) 


70^0-10+ cj2 0Tu-i\ 

V p“^0 ) ' 


(144) 


We next define a null hypothesis Hq and an alternative hypothesis Hi as follows. Let s* = min(so — 
IjSfi) and Si = So — s* > 1. The null space is a singleton Hq = {^ = (0,1, d^)} with 0i = 0, 
Iloilo = Si — 1 and G (0,c]. We further let S = supp(0) and denote by tthq the point mass prior 
on Hq. 

Next we construct the alternative parameter space Hi. First, we define the following set 

Ail', fc) = 1 5 : (5 G , ||7||o = k, 5i £ {0, for 1 < z < pi| , (145) 
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where pi = p — si. We set k = min(s*, {p — 1.01)/i/) where p comes from the constraint ||ri||oo < P 
in the definition of r(so, so) • Later in the proof we enforce some constraints on the value of v and 
in the hindsight, set a suitable value for v that complies with those constraints. 

For a given 5 G , define as follows (here the block decomposition corresponds to decompo¬ 
sition [p] = {1} U S' U {S^ \ 1)): 


/IIPIP+d2 

0 

0 J 

aS'^ \ 

0 

1 

Olxsi 


0 s 

Osi X1 

Isi XSi 

0^1 xpi 

\ aS 

<5 

Opi X Si 

IpiXpi / 


We let : <5 G k)} and construct the alternative space 

Hi = |(0, ; 7 = for some G . 


(146) 


(147) 


We need to show that if G then /i(Il'^) G r(so,SQ,p). Let {9,Q,a‘^) = /i(S^). Then, 


01 


1-||<5P’ 


Gs 


0 s, ^5=\{l} 


{a-01)5. 


(148) 


Therefore, ||0||o = l-|-|5|-|-||(5||o = l-|-(si —l) + s* = sq- Further, if i/ < 1/y^, then ||(5||2 < < 1 

and we have 

= \\0f + a^ - -d(<T-0i)||(5f = < c. (149) 

Finally we note that 

0 -5'^ \ 

(1 ||(5|| )l5]^XSi O^^Xpi (150) 

Opixsi (1 ~ ll'^lP)IpiXpi + 


n = 


1 - 


Hence, maxjgjp] \{j / i,^ij / 0}| = ||(5||o < s* < sn- Further, (H ^)jj = 1 for all i G [p]. Also by 
Weyl’s inequality, if ||(5||2 < < min(C'max-l, 1-Cmin), then Cmm < 0 -min(L;) < Crmax(L:) < Cmax- 

The last condition is on ||H||oo- We have 


||L!||oo < 


l-||hP 


P-0-01 < 

1 - (p- 1.01)i/ “ 


(151) 


where the second inequality is due to the fact that S G A{i',k) and k < {p — 1.01)/j^. The last 
inequality holds if we choose u < . 

Summarizing, (P,H,cj^) G F(so,'So,p) if we choose 

1 / < min|-^(C'max - 1), -^(1 - Clmin), - 7 —-TdTLI • 
t V s* i/s* p(p—1.01)1 


Let TT be the uniform prior on 6 over Aiiy, k) for a fixed u (whose value is to be determined later) and 
denote by tthi the induced prior over Hi. We define fi and /o as the density function of marginal 
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distribution of data {y,x) with priors tthq and respectively. Precisely, for 7 = { 6 ,Q,a‘^) and 
i G {0,1}, we have fi{y,x) = f fj(y,x) 7 ri(dj), where is the induced density on {y,x) for random 
Xi ~ N(0, and noise w ~ N(0, with y = {x, 6 ) + w when we fix the signal 9. 

Applying [CG15, Lemma 1], we have (noting that 61 , 9i are deterministic) 

E^{£{My,X))} > |0i-0i|(i-2a-TV(/i,/o))^, (153) 

where for two density functions TV(/i, /o) = f \fi{z) — fo{z)\dz denotes their total variation distance. 
Also recall the distance between /i and /q: 

It is well known that TV(/i,/o) < fo)- Using [CG15, Lemma 2 ] we have 

x\fufo) + 1 = E,, 5(1 - < E,,-exp(4n<5'^5), (154) 

for 6 and 6 two independent random draws from prior vr over A{i', k). 

By [GG15, Lemma 3] we obtain 

E^^exp(4n(l'''5) < en-*’^1—^ ^ (155) 

We set i' = cy^(logp)/n. Since k < sq < for some constant rj G [0,1/2), by choosing c small 
enough, we can ensure that TV(/ 7 rj^^,/ tt^^) < 1/2 — a. Further, given that s* < so ^ n/logp and p 
is a constant, condition (152) holds true for small enough c. 

Finally, by invoking inequality (153) and substituting for 9i from Equation (148) and 9i = 0, we 
obtain 

E^{l{Ja{y, X))} > >; kv^ = min(pz^, s^i?). (156) 

Note that the inequality (156) implies that (F(so, so)) > min(pz/, s*z/^). Using v x (logp)/n and 
s* = min(so — l,so), we get Ex^{i{Ja{y,X))} > (logp)/n, s*(logp)/n). Proof of the lower 

bound rate '^ly/n follows along the same lines as the proof in [CG15, Theorem 3]. 

It is worth noting that Equation (156) is much stronger than the implied minimax lower bound. 
Indeed it shows that the expected length of confidence intervals at any given point in a large subset 
of F(so,so), namely {(0,1, ct) : ||0||o = si — G (0, c]}, is at least of the provided lower bound 
rate. 


J Proof of Theorem 5.1 and Corollary 5.2 

J.l Proof of Theorem 5.1 

Throughout the proof, we will use 6 = A) to denote the Lasso estimator. Using the KKT 

conditions, it is immediate to see that this satisfies 

0 = r?s(0^) 

e* + -VLX'^w + -^r) , 

n y/n ) 


= 


45 


(157) 

(158) 








with R = — I)(0* — 6 ) defined as in Theorem 3.8. We also define 0° by 

^ = ?7s(^r + ■ 


( 159 ) 


Recall that S = supp(0) is the support of the Lasso estimator. By Proposition 3.5, we have, with 
high probability |5| < for a constant C*. Define = supp(^). Proceeding as in Proposition 

3.5, we obtain, with high probability |S'°| < C*so as well. Letting S = S U S^, we have |5| < 2C*so- 
Write = 6* + n~^QX'^w, r = Rjy/n. By Eq. (158), and the definition of ? 7 s( •), cf. Eq. (49), 
we have 

i||sV 2(0 _ ^0 _ ^)||2 ^ _ ^0 _ ^)||2 ^ ^ 

Expanding the squares on both sides, this can be rewritten as 

l||sV 2(0 _ ^)||2 _ (^,S(0 - P)) < -{(e- ^),S(^ - z^)) + All^lli - A||0||i. (161) 

By the KKT conditions for 6 ^ (which follow from the definition (159), and the definition of ? 7 s), there 
exists a vector v{9^) in the subgradient of the £i norm at 6 ^, such that S(0^ — z^) + Xv{ 6 ^) = 0. 
Hence, by dehnition of subgradient 

i||sV 2(0 _ ^)||2 _ (^,S(0 - ^)) < -A[||0||i - ll^lli - (u(^), (0 - ^))] <0. (162) 


Using the assumption (Tmin(E) > Cmin, we have 

Cnxinll^-^lli <2(Sr, (e-e^)) 

<2\\{j:r)sh\\e-d^h. 


Hence 


||0 - ^lli < ^{Il%r5ll2 + II%,S=^SHI2} 

^min 

<^{CLj5|||r||^ + /i2|5|||r^c||^} 
'^min 

il Tl 


The proof is completed by using Theorem 3.8. 


(163) 

(164) 

(165) 

(166) 
(167) 


3.2 Proof of Corollary 5.2 

As in the previous section, we use 9 = A) to denote the Lasso estimator and define 9^ by 

9^ = ri(^9* + ^X'^w,Xy (168) 
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Note that, by Lemma 3.4, we have ||X'''i(;/n||oo < A with high probability, whence 5*^ = supp(0°) C 
S = supp(0*). By triangular inequality and Theorem 5.1, we get 

\\e- e *\\2 = 11 ^ - r 1 I 2 + = ll(^ - 0 *)sl |2 + • (leo) 

We next show that ||(^ — 0*)5|| concentrates around its expectation. Fixing X G define 

F{w, X) = 11^ - 0 JII 2 = \\r]{e* + + ^R)s - e*s 


Noting that the soft-thresholding function r/(-; A) is 1-Lipschitz continuous, we have 


F(w, X) - F(w'-, X) = 7](9* + -nx'^w + -^R)s - 0*s 

n \/n 


< 


r](e* + -nx'^w' + -^R)s - 0*s 
n wn 


ri(e* +^X^w,x) -r](9* +^X^w'-,x) 

V n / S V n ) i 




< -IIX5II2 ||i« - u ;'||2 ■ 


n 


(170) 


Next by the Bai-Yin law [AGZ09]), we have ||Y 5'||2 < 2(yGo-|-\/n), with high probability. Therefore, 
using So < n, we obtain F{w\X) — F{w'] X) < 4\\w — w'\\ 2 /y/n. 

Denote by and probability and expectation with respect to w. By Gaussian isoperimetry 
[LedOl], we have P(T(ti;; Y) — K.u){F{w; X)} > t) < , for some universal constant c > 0 . 

This implies E^||(0° — 0 *) 5||2 = E^{||(0^ — 9*)s\\%}^^‘^ + 0{a/^yn), and therefore 

ta 
/n 


ll(^ - r)5||2 < E4||(^ - 9*)s\\IY/^ + ^, 


(171) 


with probability at least 1 — 2e “ . Using this together with Eq. (169), we get 


\\9-9*\\2 = \\^-9*\\2 + Op 


a So logp\ 


n 


a ^, q-S Q logp \ 
n J 


= ytOKYLLS) + Op (^ V 
= ,/ E E{W«.* + n-‘OZj;A)-9*]q+Op(^V 

y iesupp( 0 *) 


(Tso logp^ 


n 


(172) 

(173) 

(174) 


where in the last equality expectation is with respect to Zi ~ N(0, llxjUl/n). The proof is completed 
by using the fact that, with high probability, maxjgjpj — l| < C^y{logp)/n, and bounding 

the resulting error. 

K Proof of Theorem 5.3 

Throughout this proof, we denote by P^ and the probability and the expectation with respect 
to the noise vector w (conditional on Y). Let 


- _ , / 2o-2(DSO)iilog(p/so) 

V 


(175) 
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(176) 

(177) 


and define the estimators 6 ^^'^ by 

fp = r^{e* + ^{nx^w)^-T^), 
e^p = r^{e* + \nx^w)i-T^. 

Throughout this section, denotes a deterministic sequence with —>• oo arbitrarily slow as 

n —>■ oo. First we claim that,, ||0*'^^||o, ||^^^^||o < s^Ln with high probability for any such sequence 
Ln- In other to prove this, recall that S = supp(0*), and consider i ^ S. Conditional on X, we 
have {VLX'^w)i/n ~ N(0, (T^(IlSn)jj/re). Hence, for Z ~ N(0,1) independent of X, and by Wn a 
chi-squared random variable with n degrees of freedom, we get 


/o) =P(|(OXTtf;)i| >nTi) 

= ¥{\{atQ)]pZ\ > ^2^idog{p/sP) 

< 1P(|^| > ^2(1 + <5)-^ log(p/so)) + P((OSfl)ii > (1 + 5)a*) 

/ \ 1 — 5 

- { ?) +HWn>n{l + S)) 

\P J P 


(178) 

(179) 

(180) 
(181) 

(182) 


where the last inequality follows with high probability by taking 6 = Co^/log{p/so)/n, and using 
the assumption that log{p/so)^/n —)■ 0. Hence, by Markov inequality ||0^^^||o < soLn with high 
probability. The claim follows by the same argument for 

We next claim that < sgLn with high probability as well. Indeed, by definition 

9f'> = r](9* + -(OX'^u;)i + , (183) 

\ Tl W Tl / 


Using the fact that ||7?||oo < Co^ so(logp)2/re, with high probability (cf. Theorem 3.8) and proceed¬ 
ing along the same lines as above, we obtain 


^2) ^ 0) < P l^l (Slt^PpZ\ > v'2aHog(p/so) - + p 

so(logp)2log(p/so) 1 , 


<(y) exp<!C 


n 


7 


+ e 


+ 0 ( 1 ) 


< 


P 




(184) 

(185) 


where in the final step we used the assumption so(logp)^/n —>■ 0. Hence, by Markov inequality, we 
have ||0*'^^||o < so+nj with high probability as claimed. 
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Therefore , with high probability, 


||«<"> - e'‘>lb < ^l|fl|U\/ll»W|lo + ll«(‘>llo (186) 

y/n '' 

(187) 

y/n V n n 

Analogously, we have 

< \/||^^^^||o + ll^^^^llo • max|fi - Til (188) 

i£\p\ 

< v/soAnC'max \l ^ ^OgCp/^o) . ^ax | (flSn)ii - fljj | , (189) 

V n ie[p] 

where we used the fact that C~l^ < Clu < is bounded uniformly and \y/x — y/y\ <\x — ?/|/\/4c 
for x,y > c. Since (flSfl)jj/Qjj is distributed as Wnin, for Wn a chi-squared random variable with n 
degrees of freedom, and flu < we have maxjgjpj |(nsn)jj — Qjj| < Cy/ (logp)/n. Substituting 

above, we get 

_ ^1)II 2 < ^2LnCra^.Ca . (190) 

Hence, using triangular inequality together with Equations. (187) and (190), we obtain 

||0(2) _ 0*||2 < _r ||2 + Cc7v^^^^^ (191) 

< ||4'^-^5ll2 + |p5=ll2 + Ca/L;^°^, (192) 

for some constant C > 0 . 

We are left with the task of bounding and pgip- 

• Bounding Fixing X G we let F{w,X) = p^^ — 6 <J|| 2 - Letting af = 

a‘^{flTjfl)ii/n, and denoting by Z ~ N(0,1) a standard Gaussian random variable, we have 

E^{Fiw,Xf} = [n{0* + aiZ-,aiy/2log{p/so)) - 0*]'} (193) 

ies 

< 21og(p/so) (194) 

ieS 

§ {i + c/^} . ( 1 %) 

Here, (a) follows because {flX'^w)/n ~ N(0, cr^^), ( 6 ) because the soft-thresholding risk is maximized 
for 6 * —)> (X)[DJ94, DJ95, DMM09], and (c) because, as remarked above, with high probability we 

have maxjg[p] |(HSH)jj - < Cy/{logp)/n. 

_ 2 

Recall that F denotes the upper bound on the right-hand side of Eq. (195). Let Qo denote the 
set of matrices X for which the bound E^{E(rc;X)^} < F holds. By above argument IP(^o) 1 
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as n,p ^ oo. Now note that, since r]{ - \t) is Lipschitz continuous (with Lipschitz constant equal to 
one), and denoting by d^^\w) the vector dehned in Eq. (177) with noise vector w, we have 

\F{w-X) - F{w'-,X)\ < ||0(i)(u;)s - (196) 

<lc-‘j|Xs|b||»-«,'|b. (197) 


By the Bai-Yin law, we have IIY 5 II 2 < 2{y/n + y^) < 4-y/n with high probability (since sq < n). 
Define, ^ = ^0 H {X G : IIY5II2 < 4\/n}. By Gaussian isoperimetry [LedOl], we have, on Q, 

F^(^F{w,X) >E^{F{w,X)}+t^ < This implies E{F(u;; Y)} = E + 0 (cj/^). Hence, 

with high probability. 


i(i)i 


E„ 



(198) 

we let af = a‘^{Q,TjQ)ii/n. Denoting by Z ~ 

3 

N(0,1) a standard 

2 } = ^ IEz{r/(criZ;criv/21og(p/so)) | 
i&S<^ 

(199) 

< ^ criIEz{r/(Z; v^21og(p/so))^} 

ieS‘^ 

(200) 

‘1’ c22 W 

(201) 

2 

Trace(DSD). 

(202) 


np 


Here, (a) follows because rj{cx‘,cX) = cr]{x]\) and (6) by a Gaussian integral calculation. As men¬ 
tioned above, (DSD)jj/Djj is distributed as Wnjn for Wn a chi-squared random variable with n 
degrees of freedom. Tail bounds on chi-squared random variables, together with the fact that 
Dii < is bounded uniformly, imply that Trace(DSD) < Cp, with high probability. Hence, 

with high probability with respect to the choice of Y, < Cs^a^jn for some constant 

C > 0. Hence, with high probability 



2 


<aJFF 

V n 


(203) 


The proof is completed by putting together Equations (192), (198), (203) and setting = logp. 


L Proof of Theorem 5.4 

Throughout this section, we let d = denote the Lasso estimator. Define ^ by 

= r](e*+ ^X^w,)^ , (204) 
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where rj{ - ] X) is componentwise soft thresholding, defined for scalars via r]{x-, A) = (|x| — A)+sign(x). 
Further we denote by and probability and expectation with respect to w (conditional on Af). 
Finally, let S = supp(6l), = supp(0°) and S' = S' U S°. 

Expanding the square in the definition of R(y,Af), we obtain 


R(y, X, r) - R(y, X) = - (n;, X(0 - 6*)) (205) 

n 

= -{w,x{^-e*))+ -{w,x{e-d^)) ( 206 ) 

n n 

= Ai{w,X,6*) + A 2 {w,X,d*). (207) 


We will separately study the error terms Ai and A 2 . 

We start by considering a preliminary remark. 

Lemma L.l. Let X G have iid entries Xij ~ N(0,1), and define 


Qi{M) = 


X G 


hnxp 


max 

ie\p\ 



n 


< M 



max 


{Xi,Xj) 





(208) 


Then, for M a large enough constant, we have E(A G Qi{M)) >l—p 

Further, under the assumptions of Theorem 5.4, we have P(S C S) > 1 — p 


Proof. The lower bound on P(A G Qi{M)) is standard, and follows from union bound along with 
tail bounds on chi-squared random variables. 

As for the lower bound on P(S C S), using the definition (204) we get that 

P(S° %S)< P(S° 2 S; X G Qi{M)) + ¥{X 0 GiiM)) (209) 

< ^ P(| ^ (XTu;),| > A; X G gi(M)) (210) 

<pPf^|Z| > a)+ p-i°. (211) 

\ Jn J 


where, in the last expression Z ~ N(0,1), and we used maxjg[p] ||xj ||2 < 1-1. The claim then follows 
by a direct calculation. 

In order to bound P(S ^ S) note that, by definition. 


= p(9* — X'^w -\ -A ) . 

n Jn 


(212) 


The proof follows the same lines as above noting that, by Theorem 3.8, ||i2||oo/\/ra < A/100 with 
high probability. □ 

Lemma L.2. Under the assumptions of Theorem 5.4, there exists a constant C such that, with high 
probability 


Cspa"^ / so(logp)3 

n V n 


(213) 
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Proof. We have 


IA 2 I < ^||(xTu,)^||2||0-^||2 (214) 

<l^\\\X^w\U\9-d^h (215) 

< 2ay^nlogp ■ Ci/^||i?||oo , (216) 

n \ n 

where the last inequality follows from Lemma L.l along with the bound (167), for S = I. Using 
Theorem 3.8 we obtain the claim. □ 

Next consider term Ai in the decomposition (207). We first compute its expectation with respect 
to the noise vector w. 

Lemma L.3. Assume X to have i.i.d. rows Xi ~ N(0, S). Then we have, with high probability with 
respect to the choice of X, 


2cr^ 

E^{Ai}- 

n 

Proof. Using Stein’s lemma, we get 

E^{Ai} 



2 P 
i=l j=l 




(217) 


(218) 

(219) 


By differentiating the KKT conditions that follow from the definition of r/s, cf. Eq. (49), we get that 


for y = r/s(^), the following holds true 

||=nfe#0)[(ETT)-‘ST,-],t, (220) 

where T = supp(?/). Recall that 9^ = rjsiz) with z = 9* + n~^QX'^w and = supp(0°). Therefore, 
ago P ago a p .. 

Sit = E ^ ^ (221) 

k' = l ^ k' = l 

1 P P 

= 7 ^ 0 ) ^ [{^soso)~'^so,.]jk'^k'k)Xik ( 222 ) 

k=l k'=l 

= ;•(??"») E (%«.?■ (223) 

fceso 

Substituting in Eq. (219), after some manipulations we get 

2 ij 2 ^ 

E^{Ai} = — E^{Trace((S^05o)“^5]^o gio)} • (224) 
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Using [JM14a, Lemma 6.2], we have |S — I|oo < (logp)/n, with high probability. Hence, 

E^{Trace((Sgoso)“'%o,5o) - |5°|} < C^I^E^{\S^\} . (225) 


The claim follows. 

Lemma L.4. Under the assumptions of Theorem 5.4, the following holds 

.,^2' 

— E^Ai 


>^) <2e-'=*^+o„(l). 
m 


Proof. Define the event 


Q^iM) = g^{M) n{xe < 2 } 


□ 


(226) 


(227) 


Using Lemma L.l, together with standard tail bounds on the singular values of Wishart matrices 
[AGZ09], we get P(X G g 2 {M)) > 1 — p~^ — e~'^^. 

Define the set 

1 


C = \w G MT : —< A; ||rc ||2 < 2na‘^\ . 

I n J 


(228) 


By a union bound argument, it is immediate to see that, for any X G G 2 {M), P(t(; ^ C) < p ® + e 
Further note the following: 

1. C is convex. 

2. For w e C, we have C S. 

3. As a consequence, for rc G C, 


\\o-e*\\i<so\\es-e*s 


* 112 
oo 


< so^A + — IIA'*'7n||oo^ 


(229) 


< 4soA^ 


In order to prove the lemma, we will use Gaussian concentration [LedOl], by proving that w i—>■ 


Ai{w, X,9*) is Lipschitz continuous on C. We have 

^ ?fe, (S» - r)> + A (X’'u.),X., ( 230 ) 

j&SO 

= l(^xie^-e*)). + ^ixp^,x^w)i, (231) 

where P^o £ is the projector onto the indices in 5°. Namely, (Pgo)ij = 0 if i / j, and 

(P 5 o)m = I(* £ S^)- Hence 

llvAill^ < - e*),x'^x{P - e*)) + (232) 

11 i\z ^4110 II z 

< ^((^ - 9*)s, ^ssi^ - 0*)s) + ^\\XP^,X^w\\l (233) 

< ^A^ax(S55)||^ - 0*g + • (234) 

n D UZ 


53 








Next note that 


1 I 

n ' 


XP^oX^W^ = ^llXPgoll" 

(235) 


(236) 

— '^max(^^0 ^o) — '^max(^S',5') • 

(237) 

X £ Q 2 {M), we get 

< -An2ax(S5s)| 11^ - 0*111 + -Amax(Sss) llw^lls) 
n 1 n J 

(238) 

< — (4soA^ + 2fj^) 
n 

(239) 

^ 16/4C«a2logp_^2^,-, 

n \ n ) 

< 

(241) 


n 


Hence, using Gaussian concentration [LedOl] (applied to the Lipschitz extension of Ai from w £ C 
to w ^ C), we get 


P^^|Ai — Med^(Ai)| >t^< P^^|Ai — Medi„(Ai)| > t; w £ c'j +P^(r(; 0 C) (242) 

< +F^{w^C), (243) 

where Med^( •) denotes the median w.r.t the measure The claim follows by bounding |Med^(Ai) — 
E^{Ai}| in the standard way, and using the fact that P(A ^ G 2 {M)), F{w 0 C) —)• 0. □ 

Lemma L.5. Fix X £ Qi{M), and let be any sequence with —)> oo as n ^ oo. Then, we have 

E„{||5”||o}<so + 1, (244) 

IP»(|l|9”llo - E„{||§”||„}| > hifSlAAd) < " . (245) 

Proof. By Lemma L.l, P(^c = 0) > 1 —p~^. We thus get Eu,||^||o < E^||^c||o + so < p-p~^ + so < 

1 + So- Since E{ 05 c = 0} > 1 —in order to prove Eq. (245), it is sufficient to develop a tail bound 
on I Iloilo —E^lll^llo}, which we do via Chebyshev inequality. Letting Tj = I(|0* +n~^{X'^ w)i\ > A), 
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we have ||0^||o = whence the variance of H^^Ho is given by 


Var^(||0s||o) = ^ Cov^(T,;r,-) 
i,j&S 

CoYw{{X'^w)i-, {x'^w)j) 


(“) ^ 

s E 


y/Xav^{{X'^w)i)Xav^{{X'^w) 


ijeS V ’“"'ujvw- 

E \\i72\, 

■ ._c r^* 2 H'l 2 

- / \ 2 

logp 


< M 


< Msl 


n 


j2Vy^) 

ieS / 


logp 


Var(rOVar(Tj) 


n 


(246) 

(247) 

(248) 

(249) 

(250) 


Here (a) follows because, for jointly Gaussian random variables Zi, Z^-, the correlation coefficient 
between f{Zi), 5 (^ 2 ) is maximized by linear functions /, g. 

The claim (245) follows from Chebyshev inequality, using Eq. (250). □ 

Lemma L.6. Let Ln be any sequence with —>■ 00 . Then, under the assumptions of Theorem 5-4, 

we have, with high probability. 


-|| 0 ' 


2D| 


E LjiSq 


so (log p)^ 


n 


Proof. Recall that, by definition 


9 = g 


( 9 * + - 

\ n 


-X^w + ^R]X] , 

n vn 


9 ^ = g(^9* + . 

Let En = CY^so(logpp7^ for C a sufficiently large constant, and define the event 

^0 = |||-R||oo < o-en; S'csj. 

By Theorem 3.8 and Lemma L.l, P(^o) —)• 1 as n,p —)• 00 . On this event, we have 


lo-ll^llo < Vl( 9l + -{X^w)^ G A-^||R||oo,A+^||R||c 
ieS ^ 

— {X w)i G A-^,AH- -j=. 

n L \/n \/n J 


1 


1 


n 


<^I ei + -{X^wf G A-^,A + ^ 
V n . /n . /n. 

i^S ^ 

iES 


(251) 

(252) 

(253) 

(254) 

(255) 

(256) 

(257) 
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We then have, for any sequence —)■ oo, 


|o-||^||o| >LnS 0 Sn) < F > LnSoEn] Gl{M)^ + P(gS) + ^(^1 (M)") . (258) 


ieS 


Using Lemma L.l, it is sufficient to show that the first term vanishes. This can be done by Markov 
inequality, bounding the expectation as follows 


E 




ies 


ies 

(“) 

< 2 2_^ sup E 


i&S 


zSK 


n 

0'||Xi||2 

n 


(JEji O'En 

A -j =, A -\- 

n vn J 




Z G 


z -^,2; + 

n Jn J 


\Qi{M) 


(b) 


< 2 so SUpP {Z G [z — 2En, z + 2En]) 

zGR 

^ C SqEji , 

where (a) holds for Z ~ N(0,1), and X G Qi{M) was used in (6). 
Proof of Theorem 5.4- First notice that 

2u2 


2^2 ^ 

Ai-||0||o 

n 


< 


< 


(a) 


Ai-—Iloilo 

n 

2 


+ 


n 


„2D, 


+ 


n 


+ 

20 


E^Ai-Ei„||0^||o 

n 

Iloilo-E^lie'^llo 


^llo-Iloilo 

Ai — Eii,Ai 
2^2 


(259) 

(260) 

(261) 

(262) 

□ 

(263) 

(264) 


+ 


n 


lieilo-lEllo 


< 2C'soO' 


2 /logp , 2tcr 


H- 1 = —I— \~ 2 LnSQO'' 


(log p) 1/4 


n-^ ^Jn n^/i 


+ 2L„ct2^^^ ^ logp (265) 


2ta^ QLnSQa^ ( /logp\i/4 /so(logp)2\ 1/2 


— ——/ 
In 


n 


n 


V 


n 


(266) 


where the inequality (a) holds probability larger than 1 — 0^(1) — 2e by lemmas L.3, L.4, L.5, 
L.6 for any sequence Ln ^ 00 as n ^ 00 . We let 


— 6Z/,2 


logp^^l/4^ ^S0(logp)^\l/2 


n 


n 


Using the decomposition (207), we have 


R{y,X,e*)-R{y,X)-—\\e\\o < Ai-—Iloilo 

n n 


+ IA 2 


< Cso<t 2 _ /so(logp)2 


n n n 

2tcj2 2e„so0'2 

< , 

In n 


n 


(267) 

(268) 

(269) 

(270) 
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where the last inequality holds for all n large enough. 

By choosing Ln to be a sequence with slow enough growth rate, e.g. Ln = ( 
En 0. This completes the proof for Gaussian designs. 


— n"' we have 

so{logp)2^ ’ 

□ 
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